Closed AlexIzydorczyk closed 9 years ago
/execute endpoint doesn't handle proxy profiles automatically, as well as most other render.xxx arguments. Values of the arguments are available in splash.args table; the idea is to use Splash scripting features to handle them.
With /execute you can use splash:on_request to set a proxy - see example 5. There is no need to attach volumes and create proxy profiles if you use /execute endpoint.
Note that currently Crawlera doesn't play well with Splash - it uses large delays for each requested resource (including js, css, etc. files), and sends them through different IPs. So the full request will need much more time to finish, and the requesting behaviour is not natural. We're working on solution now.
In the meantime you can try using Crawlera only for the first request; depending on a web site it could work. Write this logic in splash:on_request handler.
Thanks for the quick reply @kmike,
I'm getting the same error with the /render.html endpoint:
http://splash:8050/render.html?url=http://www.whatismyip.com&proxy=crawlera
I've tried the splash:on_request handler and am getting a Lua scripting error even when copying the example 5 in the docs.
I've tried a different proxy service as well and am getting the same results....
/execute endpoint doesn't handle proxy profiles automatically, as well as most other render.xxx arguments. Values of the arguments are available in splash.args table; the idea is to use Splash scripting features to handle them.
To elaborate: suppose you pass 'wait=1' and 'url=http://example.com' arguments to /execute endpoint. By default this does nothing, but you can write a script like this to get a behaviour similar to render.html:
function main(splash)
assert(splash:go(splash.args.url))
assert(splash:wait(splash.args.wait))
return splash:html()
end
I've tried the splash:on_request handler and am getting a Lua scripting error even when copying the example 5 in the docs.
What is the error message?
In the example 5 you need to pass 'username' and 'password' arguments to /execute endpoint
Therein lies the problem - I'm not getting an error in the logs, it's just as if every site were blacklisted not to use the proxy (if I intentionally misformat the .ini file I am but otherwise it looks like it is recognizing the file).
Re:example 5 - Yes, I had manually hardcoded those into lua..
Some more details..
function main(splash)
splash:on_request(function(request)
request:set_proxy{
host = "mydomain.crawlera.com",
port = 8010,
username = "myuser",
password = "mypassword",
}
end)
assert(splash:go("https://www.google.com/"))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
returns: unhandled Lua error: [string "function main(splash)..."]:2: attempt to call method 'on_request' (a nil value)
What is your Splash version? splash:on_request is a new feature in Splash 1.6
Splash 1.6
I'm pulling and building from the latest docker build
Try updating your scrapinghub/splash image - likely it is not Splash 1.6. Just tried it; for me both 1.6 and 'latest' work:
docker run -it -p 8050:8050 scrapinghub/splash:1.6
If I'm not mistaken, docker pull scrapinghub/splash
should update the image.
hmm.. still no luck.
Am putting
http://splash:8050/execute?lua_source=function+main%28splash%29%0A%09splash%3Aon_request%28function%28request%29%0A++++%09request%3Aset_proxy%7B%0A++++%09%09host+%3D+%22domain.crawlera.com%22%2C%0A++++%09%09port+%3D+8010%0A++%09%09%7D%0A%09end%29%0A++%0A++assert%28splash%3Ago%28%22https%3A%2F%2Fwww.google.com%2F%22%29%29%0A++assert%28splash%3Await%280.5%29%29%0A++return+%7B%0A++++html+%3D+splash%3Ahtml%28%29%2C%0A++++png+%3D+splash%3Apng%28%29%2C%0A++++har+%3D+splash%3Ahar%28%29%2C%0A++%7D%0Aend
``` into the browser..
It's this script:
function main(splash) splash:on_request(function(request) request:set_proxy{ host = "domain.crawlera.com", port = 8010 } end)
assert(splash:go("https://www.google.com/")) assert(splash:wait(0.5)) return { html = splash:html(), png = splash:png(), har = splash:har(), } end
and still getting:
unhandled Lua error: [string "function main(splash)..."]:2: attempt to call method 'on_request' (a nil value)
Sorry, I still don't understand how can it happen in Splash 1.6. The error attempt to call method 'on_request' (a nil value)
means there is no on_request
method, but it is present in 1.6 - it works for me for scrapinghub/splash:latest, scrapinghub/splash:1.6 docker images and when executed locally, and tests pass on Travis for 1.6 branch and for master branch.
Try visiting http://splash:8050/ - which version number is displayed?
Ok - I am indeed using 1.6 - I just rebooted the entire instance and pulled a fresh docker image... very strange..
The problem seems to be with authentication. I'm getting:
unhandled Lua error: [string "function main(splash)..."]:6: http407
with
splash:on_request(function(request)
request:set_proxy{'xxx.crawlera.com', 8010, username = 'xxx', password = 'xxx',}
end)
I am able, however, to use a proxy that doesn't require authentication though...
I think the type of authentication crawlera is requiring is not properly being pushed through - but otherwise seems to work now
I think the type of authentication crawlera is requiring is not properly being pushed through - but otherwise seems to work now
Thanks, I'll check it.
@kmike thanks for your help with this. Depending on my environment, it's alternating between the proxy authentication error 407 and timing it out:
I'm now getting:
Timeout exceeded rendering page
Perhaps the timing out have something to do with redirects/multiple proxies that crawlera uses?
I haven't started to check/debug 407 errors yet. Do you see a pattern?
Regarding timeouts - see comments above: https://github.com/scrapinghub/splash/issues/242#issuecomment-113255174 and https://github.com/scrapinghub/splash/issues/242#issuecomment-113255495. Crawlera uses long delays between requests, but these delays are not required for Splash-like workloads; we're working on a fix. In the meantime you can try using Crawlera only for the first request (by writing some ifs in splash:on_request handler) and increase timeouts - pass a larger timeout
GET argument to /execute, and start Splash with a larger --max-timeout
value if 60s timeouts are still too restrictive.
So far, seem to be getting a timeout far more often than 407 - 60s is too restrictive so I'll try with a longer range...
I'm doing something like this:
splash:on_request(function(request)
if flag == 1 then
request:set_proxy{"xxx.crawlera.com", 2010, username="xxx", password="xxx"}
flag = 0
end
end)
to proxy only on the first request...
Looks like even when I set max-timeout to 5 minutes, I get a timeout on a different front:
2015-06-19 00:41:11.837282 [render] [28935520] loadFinished: RenderErrorInfo(type='Network', code=4, text=u'Socket operation timed out', url=u'http://google.com/')
@AlexIzydorczyk did you try to use something else than Google? google.com is known to do redirects to local Google DNS names:
~ $ curl -v http://google.com
* Rebuilt URL to: http://google.com/
* Trying 188.43.66.99...
* Connected to google.com (188.43.66.99) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.42.1
> Accept: */*
>
< HTTP/1.1 302 Found
< Cache-Control: private
< Content-Type: text/html; charset=UTF-8
< Location: http://www.google.ru/?gfe_rd=cr&ei=PLSDVbC5Go2DZKmFgeAC
< Content-Length: 256
< Date: Fri, 19 Jun 2015 06:18:36 GMT
< Server: GFE/2.0
< Alternate-Protocol: 80:quic,p=0
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.ru/?gfe_rd=cr&ei=PLSDVbC5Go2DZKmFgeAC">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact
Probably that could get in the way. Could you try to request something simpler with no redirects, e.g. http://httpbin.org/get ?
@qrilka,
Looks like I'm still getting the same problem when behind the crawlera proxy even at httpbin. I'm using mydomain.crawlera.com as a host, 8010 as the password, and the right credentials as they work elsewhere.
@kmike, Splash-with-Crawlera support for Splash is a planned future update, right?
@kmike , @qrilka - by the way, does Splash respect system environment variables? That is, if I set http_proxy, would Splash route requests through that?
Hey @AlexIzydorczyk - yes, Splash-with-Crawlera support for Splash is a planned future update.
Splash doesn't respect system environment variables.
@kmike thanks, just curious - is the Crawlera/Splash fix something that can be done on the Splash side or is it a fix on the Crawlera side?
If it's on the Splash side, mind giving me a rough idea of what the upgrade would entail? So far, I've been getting around it by just making the timeout very very long.
@AlexIzydorczyk it will require some changes to Crawlera, but not in Splash. We're going to provide a Lua module to enable Crawlera in Splash.
@kmike thanks, makes sense.
In the meantime, I've managed to actually get Crawlera to work by using Squid3 as a pass-through proxy between Crawlera and Splash. It occasionally times out, but when using a cluster of docker containers running Splash, it's useable and the qt5 branch seems to be much more performant.
Your earlier comment about the queue makes a lot of sense now - the biggest bottleneck is trying to efficiently allocate requests between Splash instances without having them timeout too much but also avoiding under using resource (Splash seems to be both memory + CPU intensive).
Perhaps I need to be looking at longer Lua scripts to do multiple pages with one Splash rather than making separate requests to Splash (which I presume have overhead creating new browser objects all the time).
Scrapy Splash + Crawlera is working except for JSP pages now. Why is that? It's also showing timeout error.
Hi all,
I am running Splash in a docker container on Ubuntu 12.04.5 LTS and am having trouble getting proxy-profiles to work.
I have this in my /etc/splash/proxy-profiles/crawlera.ini file:
and I start the docker container mapping that volume to its equivalent
-v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/
. It appears that by default Splash is launched in the docker container with a flag that tells it where to look for proxy profiles.And, when I pass the &proxy=crawlera paramter into the typical
splash:8050/?render.html?ur...
url, it does not throw an error (if I pass a nonexistent proxy-profile it shows "proxy profile not found) - so I am confident it is finding the profile.In the logs, I am actually seeing:
So the proxy parameter is definitely there and recognized... but it doesn't do anything. Visiting http://www.whatismyip.com yields the same IP whether or not I have the proxy parameter on.
Any ideas? Or thoughts on how to better diagnose the issue?