Closed stever123 closed 4 years ago
You could use scrapy-proxies
https://github.com/aivarsk/scrapy-proxies/blob/master/README.md And configure settings.py
PROXY_MODE = 0
El lun., 25 mar. 2019 17:55, stever123 notifications@github.com escribió:
I have been using proxies for a while (with scrapy-spash). That was a static one and the problem is now regards rotating proxies. I have the following code:
proxies = ['82.209.49.196:8080', '217.9.91.88:8080', '85.142.158.45:8080', '134.209.115.223:3128'] for i in range(0,3): yield SplashRequest(callback = self.parse, endpoint ='execute', meta={'reqid' : 'ID{0}'.format(i), 'download_slot' : '{0}'.format(i), 'dont_retry' : False,}, args={'lua_source': self.luaScripts['checkIP'], 'proxy' : 'http:// + proxies[i]', 'timeout': 90}, dont_filter=True)
The proxy used is always the first proxy specified (in this case, 82.209.49.196:8080). This seems quite strange to me. Note that self.luaScripts['checkIP'] is a lua script that goes to https://httpbin.org/ip.
Why is it only the first proxy specified that is used in ALL your SplashRequests? How can you specify different proxies per request as with Scrapy requests (i.e. meta['proxy'])?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/211, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu0_v5jZQ4GVXtfWHHQ7yr9d94djWks5vaP9xgaJpZM4cHYOy .
Thanks man! But the problem is with Splash - the proxy is set at the first SplashRequest (at least in the Splash server, but request.body is updated correctly as I mentioned above).. I do not see how that would change it; "scrapy-proxies" seems to be about Scrapy only. It does not happen with normal Scrapy requests, so I do know how to use different proxies with those requests... Therefore, I do not see how that would solve it :/ Could you please elaborate?
So "scrapy-proxies" works by setting request.meta['proxy']
and this is exactly how to do it with Scrapy requests. However, setting meta['proxy'] will not affect SplashRequests (i.e. Splash) :/ Then you might suggest to just change request.meta['proxy']
to request.meta['splash']['args']['proxy]
to actually set the proxy through the HTTP API. Unfortunately, this is the exact one of the things I tried, as showed in the OP 😞
Just testet it - as expected, it did not work. The reason is the paragraph above.
https://github.com/TeamHG-Memex/scrapy-rotating-proxies/issues/4
El lun., 25 mar. 2019 22:11, stever123 notifications@github.com escribió:
Thanks man! But the problem is with Splash - the proxy is set at the first SplashRequest ().. I do not see how that would change it; "scrapy-proxies" seems to be about Scrapy only. It does not happen with normal Scrapy requests, so I do know how to use different proxies with those requests... Therefore, I do not see how that would solve it :/ Any elaboration?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/211#issuecomment-476379104, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu-gbOobnkVkkOV4Kl8gtd8JZKOa8ks5vaTt3gaJpZM4cHYOy .
Yes I have read that (always do your research before asking for anybody's time :-) ) but no solution is found. And "request.meta['splash']['args']['proxy'] instead of request.meta['proxy']" is tried and " instead of using this proxy for requests to Splash it may pass it as an argument to Splash" is also tried. As I said: "It also happens if I use splash:set_proxy in the script"; I would of course do this by passing ports and hosts as arguments to the Splash script. But again, the main struggle is that it seems like Splash is only allowing one proxy per "session" - which is also looks like when you check its implementation, like this.
And if it is true that Splash is setting the proxy only at initialization (e.g. at the first SplashRequest), then I think it should be fixed - but I might be wrong of course :)))
Sure, you are researching but perhaps it is better to read about lua. https://www.lua.org/docs.html
https://github.com/scrapy-plugins/scrapy-splash/issues/88
El lun., 25 mar. 2019 23:03, stever123 notifications@github.com escribió:
Yes I have read that (always do your research before asking for anybody's time :-) ) but no solution is found. And "request.meta['splash']['args']['proxy'] instead of request.meta['proxy']" is tried and " instead of using this proxy for requests to Splash it may pass it as an argument to Splash" is also tried. As I said: "It also happens if I use splash:set_proxy in the script"; I would of course do this by passing ports and hosts as arguments to the Splash script. But again, the main struggle is that it seems like Splash is only allowing one proxy per "session" - which is also looks like when you check its implementation, like this https://github.com/scrapinghub/splash/blob/3cba485e603998e6325277d1298e98ace1378f09/splash/proxy.py .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/211#issuecomment-476395038, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu6BBVIbQYpKSRHRXjtgbIhGGhaWVks5vaUefgaJpZM4cHYOy .
I have read two whole books about just Lua. That is indeed not the problem - I have been working with it for many years. I do not see why the problem would be related to Lua in my case anyways... When setting request.meta['splash']['args'['proxy']
no additional code is needed in your script; it is sent through the HTTP API and handed by Splash.
Of course, I have no right to ask you to try it yourself. But I am sure, if you tried to make 2+ SplashRequests with different proxies - only the first will be used in both Splash requests...
I have now updated the OP such that the content of my Lua script is shown :)
So the reason why it did not work was because I was sending multiple SplashRequests concurrently with different proxies and for some reason, it seems like Splash cnanot handle that. It works when completing one request at a time. Unfortunately, this solution does not scale very well, so if anyone has a solution to this "concurrency-proxies" problem, please let me know :)
You could use multiple Splash instances, and use one proxy with each, for concurrency.
Nonetheless, since concurrency was the problem, could you close this issue?
i encountered the same issue and was able to handle 1 request 1 unique proxy. here is my code
function main(splash, args) splash:on_request(function(request) request:set_proxy{ host = 'args.host', port = args.port, type='HTTPS' } end) end
I have been using proxies for a while (with scrapy-spash). That was a static one and the problem is now regards rotating proxies. I have the following code:
This is my what is in my Lua script:
The proxy used is always the first proxy specified (in this case, 82.209.49.196:8080). This seems quite strange to me. Note that
self.luaScripts['checkIP']
is a lua script that goes to https://httpbin.org/ip.Why is it only the first proxy specified that is used in ALL your SplashRequests? How can you specify different proxies per request as with Scrapy requests (i.e. meta['proxy'])?
Even
request.body
has differentproxies
(as set per request) - so this only makes it even stranger that only the one set in the first request is being used for all future SplashRequests...It also happens if I use
splash:set_proxy
in the script.