scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

Proxies not set per request? #211

Closed stever123 closed 4 years ago

stever123 commented 5 years ago

I have been using proxies for a while (with scrapy-spash). That was a static one and the problem is now regards rotating proxies. I have the following code:

proxies = ['82.209.49.196:8080', '217.9.91.88:8080', '85.142.158.45:8080', '134.209.115.223:3128']
for i in range(0,3):
   yield SplashRequest(callback = self.parse, endpoint ='execute', meta={'dont_retry' : False,}, args={'lua_source': 
   self.luaScripts['checkIP'], 'proxy' : 'http://' + proxies[i], 
   'timeout': 90}, dont_filter=True)

This is my what is in my Lua script:

function main(splash, args)
  assert(splash:go('https://httpbin.org/ip'))
  local _linksToBeFixed = 0
  return {mypng = splash:png(),}
end

The proxy used is always the first proxy specified (in this case, 82.209.49.196:8080). This seems quite strange to me. Note that self.luaScripts['checkIP'] is a lua script that goes to https://httpbin.org/ip.

Why is it only the first proxy specified that is used in ALL your SplashRequests? How can you specify different proxies per request as with Scrapy requests (i.e. meta['proxy'])?

Even request.body has different proxies (as set per request) - so this only makes it even stranger that only the one set in the first request is being used for all future SplashRequests...

It also happens if I use splash:set_proxy in the script.

JavierRuano commented 5 years ago

You could use scrapy-proxies

https://github.com/aivarsk/scrapy-proxies/blob/master/README.md And configure settings.py

Proxy mode

0 = Every requests have different proxy

1 = Take only one proxy from the list and assign it to every requests

2 = Put a custom proxy to use in the settings

PROXY_MODE = 0

El lun., 25 mar. 2019 17:55, stever123 notifications@github.com escribió:

I have been using proxies for a while (with scrapy-spash). That was a static one and the problem is now regards rotating proxies. I have the following code:

proxies = ['82.209.49.196:8080', '217.9.91.88:8080', '85.142.158.45:8080', '134.209.115.223:3128'] for i in range(0,3): yield SplashRequest(callback = self.parse, endpoint ='execute', meta={'reqid' : 'ID{0}'.format(i), 'download_slot' : '{0}'.format(i), 'dont_retry' : False,}, args={'lua_source': self.luaScripts['checkIP'], 'proxy' : 'http:// + proxies[i]', 'timeout': 90}, dont_filter=True)

The proxy used is always the first proxy specified (in this case, 82.209.49.196:8080). This seems quite strange to me. Note that self.luaScripts['checkIP'] is a lua script that goes to https://httpbin.org/ip.

Why is it only the first proxy specified that is used in ALL your SplashRequests? How can you specify different proxies per request as with Scrapy requests (i.e. meta['proxy'])?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/211, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu0_v5jZQ4GVXtfWHHQ7yr9d94djWks5vaP9xgaJpZM4cHYOy .

stever123 commented 5 years ago

Thanks man! But the problem is with Splash - the proxy is set at the first SplashRequest (at least in the Splash server, but request.body is updated correctly as I mentioned above).. I do not see how that would change it; "scrapy-proxies" seems to be about Scrapy only. It does not happen with normal Scrapy requests, so I do know how to use different proxies with those requests... Therefore, I do not see how that would solve it :/ Could you please elaborate?

So "scrapy-proxies" works by setting request.meta['proxy'] and this is exactly how to do it with Scrapy requests. However, setting meta['proxy'] will not affect SplashRequests (i.e. Splash) :/ Then you might suggest to just change request.meta['proxy'] to request.meta['splash']['args']['proxy] to actually set the proxy through the HTTP API. Unfortunately, this is the exact one of the things I tried, as showed in the OP 😞

Just testet it - as expected, it did not work. The reason is the paragraph above.

JavierRuano commented 5 years ago

https://github.com/TeamHG-Memex/scrapy-rotating-proxies/issues/4

El lun., 25 mar. 2019 22:11, stever123 notifications@github.com escribió:

Thanks man! But the problem is with Splash - the proxy is set at the first SplashRequest ().. I do not see how that would change it; "scrapy-proxies" seems to be about Scrapy only. It does not happen with normal Scrapy requests, so I do know how to use different proxies with those requests... Therefore, I do not see how that would solve it :/ Any elaboration?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/211#issuecomment-476379104, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu-gbOobnkVkkOV4Kl8gtd8JZKOa8ks5vaTt3gaJpZM4cHYOy .

stever123 commented 5 years ago

Yes I have read that (always do your research before asking for anybody's time :-) ) but no solution is found. And "request.meta['splash']['args']['proxy'] instead of request.meta['proxy']" is tried and " instead of using this proxy for requests to Splash it may pass it as an argument to Splash" is also tried. As I said: "It also happens if I use splash:set_proxy in the script"; I would of course do this by passing ports and hosts as arguments to the Splash script. But again, the main struggle is that it seems like Splash is only allowing one proxy per "session" - which is also looks like when you check its implementation, like this.

And if it is true that Splash is setting the proxy only at initialization (e.g. at the first SplashRequest), then I think it should be fixed - but I might be wrong of course :)))

JavierRuano commented 5 years ago

Sure, you are researching but perhaps it is better to read about lua. https://www.lua.org/docs.html

https://github.com/scrapy-plugins/scrapy-splash/issues/88

El lun., 25 mar. 2019 23:03, stever123 notifications@github.com escribió:

Yes I have read that (always do your research before asking for anybody's time :-) ) but no solution is found. And "request.meta['splash']['args']['proxy'] instead of request.meta['proxy']" is tried and " instead of using this proxy for requests to Splash it may pass it as an argument to Splash" is also tried. As I said: "It also happens if I use splash:set_proxy in the script"; I would of course do this by passing ports and hosts as arguments to the Splash script. But again, the main struggle is that it seems like Splash is only allowing one proxy per "session" - which is also looks like when you check its implementation, like this https://github.com/scrapinghub/splash/blob/3cba485e603998e6325277d1298e98ace1378f09/splash/proxy.py .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/211#issuecomment-476395038, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu6BBVIbQYpKSRHRXjtgbIhGGhaWVks5vaUefgaJpZM4cHYOy .

stever123 commented 5 years ago

I have read two whole books about just Lua. That is indeed not the problem - I have been working with it for many years. I do not see why the problem would be related to Lua in my case anyways... When setting request.meta['splash']['args'['proxy'] no additional code is needed in your script; it is sent through the HTTP API and handed by Splash.

Of course, I have no right to ask you to try it yourself. But I am sure, if you tried to make 2+ SplashRequests with different proxies - only the first will be used in both Splash requests...

I have now updated the OP such that the content of my Lua script is shown :)

stever123 commented 5 years ago

So the reason why it did not work was because I was sending multiple SplashRequests concurrently with different proxies and for some reason, it seems like Splash cnanot handle that. It works when completing one request at a time. Unfortunately, this solution does not scale very well, so if anyone has a solution to this "concurrency-proxies" problem, please let me know :)

Gallaecio commented 5 years ago

You could use multiple Splash instances, and use one proxy with each, for concurrency.

Nonetheless, since concurrency was the problem, could you close this issue?

cscervantes commented 2 years ago

i encountered the same issue and was able to handle 1 request 1 unique proxy. here is my code

function main(splash, args) splash:on_request(function(request) request:set_proxy{ host = 'args.host', port = args.port, type='HTTPS' } end) end