scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.07k stars 512 forks source link

Splash Ignoring Proxy #242

Closed AlexIzydorczyk closed 9 years ago

AlexIzydorczyk commented 9 years ago

Hi all,

I am running Splash in a docker container on Ubuntu 12.04.5 LTS and am having trouble getting proxy-profiles to work.

I have this in my /etc/splash/proxy-profiles/crawlera.ini file:

[proxy]
host=<mydomain>.crawlera.com
port=8010

; optional, default is no auth
username=<user>
password=<pass>

and I start the docker container mapping that volume to its equivalent -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/. It appears that by default Splash is launched in the docker container with a flag that tells it where to look for proxy profiles.

And, when I pass the &proxy=crawlera paramter into the typical splash:8050/?render.html?ur... url, it does not throw an error (if I pass a nonexistent proxy-profile it shows "proxy profile not found) - so I am confident it is finding the profile.

In the logs, I am actually seeing:

2015-06-18 17:33:46.661570 [stats] {"maxrss": 148776, "load": [0.0, 0.01, 0.05], "fds": 50, "qsize": 0, "rendertime": 1.3779900074005127, "active": 0, "path": "/execute", "args": {"lua_source": ["function main(splash)\r\n  local url = splash.args.url\r\n  splash.images_enabled = false\r\n  assert(splash:go(url))\r\n  assert(splash:wait(0.5))\r\n  return {\r\n    html = splash:html(),\r\n    png = splash:png(),\r\n    har = splash:har(),\r\n  }\r\nend"], "url": ["http://www.whatismyip.com"], "proxy": ["crawlera"], "images": ["1"], "expand": ["1"], "wait": ["0.5"]}, "_id": 91663608}

So the proxy parameter is definitely there and recognized... but it doesn't do anything. Visiting http://www.whatismyip.com yields the same IP whether or not I have the proxy parameter on.

Any ideas? Or thoughts on how to better diagnose the issue?

kmike commented 9 years ago

/execute endpoint doesn't handle proxy profiles automatically, as well as most other render.xxx arguments. Values of the arguments are available in splash.args table; the idea is to use Splash scripting features to handle them.

With /execute you can use splash:on_request to set a proxy - see example 5. There is no need to attach volumes and create proxy profiles if you use /execute endpoint.

Note that currently Crawlera doesn't play well with Splash - it uses large delays for each requested resource (including js, css, etc. files), and sends them through different IPs. So the full request will need much more time to finish, and the requesting behaviour is not natural. We're working on solution now.

kmike commented 9 years ago

In the meantime you can try using Crawlera only for the first request; depending on a web site it could work. Write this logic in splash:on_request handler.

AlexIzydorczyk commented 9 years ago

Thanks for the quick reply @kmike,

I'm getting the same error with the /render.html endpoint:

http://splash:8050/render.html?url=http://www.whatismyip.com&proxy=crawlera

I've tried the splash:on_request handler and am getting a Lua scripting error even when copying the example 5 in the docs.

I've tried a different proxy service as well and am getting the same results....

kmike commented 9 years ago

/execute endpoint doesn't handle proxy profiles automatically, as well as most other render.xxx arguments. Values of the arguments are available in splash.args table; the idea is to use Splash scripting features to handle them.

To elaborate: suppose you pass 'wait=1' and 'url=http://example.com' arguments to /execute endpoint. By default this does nothing, but you can write a script like this to get a behaviour similar to render.html:

function main(splash)
    assert(splash:go(splash.args.url))
    assert(splash:wait(splash.args.wait))
    return splash:html()
end

I've tried the splash:on_request handler and am getting a Lua scripting error even when copying the example 5 in the docs.

What is the error message?

kmike commented 9 years ago

In the example 5 you need to pass 'username' and 'password' arguments to /execute endpoint

AlexIzydorczyk commented 9 years ago

Therein lies the problem - I'm not getting an error in the logs, it's just as if every site were blacklisted not to use the proxy (if I intentionally misformat the .ini file I am but otherwise it looks like it is recognizing the file).

Re:example 5 - Yes, I had manually hardcoded those into lua..

AlexIzydorczyk commented 9 years ago

Some more details..

function main(splash)
    splash:on_request(function(request)
        request:set_proxy{
      host = "mydomain.crawlera.com",
      port = 8010,
        username = "myuser",
        password = "mypassword",
    }
    end)

  assert(splash:go("https://www.google.com/"))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

returns: unhandled Lua error: [string "function main(splash)..."]:2: attempt to call method 'on_request' (a nil value)

kmike commented 9 years ago

What is your Splash version? splash:on_request is a new feature in Splash 1.6

AlexIzydorczyk commented 9 years ago

Splash 1.6

AlexIzydorczyk commented 9 years ago

I'm pulling and building from the latest docker build

kmike commented 9 years ago

Try updating your scrapinghub/splash image - likely it is not Splash 1.6. Just tried it; for me both 1.6 and 'latest' work:

docker run -it -p 8050:8050 scrapinghub/splash:1.6

If I'm not mistaken, docker pull scrapinghub/splash should update the image.

AlexIzydorczyk commented 9 years ago

hmm.. still no luck.

Am putting

http://splash:8050/execute?lua_source=function+main%28splash%29%0A%09splash%3Aon_request%28function%28request%29%0A++++%09request%3Aset_proxy%7B%0A++++%09%09host+%3D+%22domain.crawlera.com%22%2C%0A++++%09%09port+%3D+8010%0A++%09%09%7D%0A%09end%29%0A++%0A++assert%28splash%3Ago%28%22https%3A%2F%2Fwww.google.com%2F%22%29%29%0A++assert%28splash%3Await%280.5%29%29%0A++return+%7B%0A++++html+%3D+splash%3Ahtml%28%29%2C%0A++++png+%3D+splash%3Apng%28%29%2C%0A++++har+%3D+splash%3Ahar%28%29%2C%0A++%7D%0Aend
``` into the browser..

It's this script:

function main(splash) splash:on_request(function(request) request:set_proxy{ host = "domain.crawlera.com", port = 8010 } end)

assert(splash:go("https://www.google.com/")) assert(splash:wait(0.5)) return { html = splash:html(), png = splash:png(), har = splash:har(), } end


and still getting:

unhandled Lua error: [string "function main(splash)..."]:2: attempt to call method 'on_request' (a nil value)

kmike commented 9 years ago

Sorry, I still don't understand how can it happen in Splash 1.6. The error attempt to call method 'on_request' (a nil value) means there is no on_request method, but it is present in 1.6 - it works for me for scrapinghub/splash:latest, scrapinghub/splash:1.6 docker images and when executed locally, and tests pass on Travis for 1.6 branch and for master branch.

Try visiting http://splash:8050/ - which version number is displayed?

AlexIzydorczyk commented 9 years ago

Ok - I am indeed using 1.6 - I just rebooted the entire instance and pulled a fresh docker image... very strange..

The problem seems to be with authentication. I'm getting:

unhandled Lua error: [string "function main(splash)..."]:6: http407

with

    splash:on_request(function(request)
        request:set_proxy{'xxx.crawlera.com', 8010, username = 'xxx', password = 'xxx',}
    end)

I am able, however, to use a proxy that doesn't require authentication though...

AlexIzydorczyk commented 9 years ago

I think the type of authentication crawlera is requiring is not properly being pushed through - but otherwise seems to work now

kmike commented 9 years ago

I think the type of authentication crawlera is requiring is not properly being pushed through - but otherwise seems to work now

Thanks, I'll check it.

AlexIzydorczyk commented 9 years ago

@kmike thanks for your help with this. Depending on my environment, it's alternating between the proxy authentication error 407 and timing it out:

I'm now getting:

Timeout exceeded rendering page

Perhaps the timing out have something to do with redirects/multiple proxies that crawlera uses?

kmike commented 9 years ago

I haven't started to check/debug 407 errors yet. Do you see a pattern?

Regarding timeouts - see comments above: https://github.com/scrapinghub/splash/issues/242#issuecomment-113255174 and https://github.com/scrapinghub/splash/issues/242#issuecomment-113255495. Crawlera uses long delays between requests, but these delays are not required for Splash-like workloads; we're working on a fix. In the meantime you can try using Crawlera only for the first request (by writing some ifs in splash:on_request handler) and increase timeouts - pass a larger timeout GET argument to /execute, and start Splash with a larger --max-timeout value if 60s timeouts are still too restrictive.

AlexIzydorczyk commented 9 years ago

So far, seem to be getting a timeout far more often than 407 - 60s is too restrictive so I'll try with a longer range...

I'm doing something like this:

    splash:on_request(function(request) 
        if flag == 1 then
            request:set_proxy{"xxx.crawlera.com", 2010, username="xxx", password="xxx"}
            flag = 0
        end
    end)

to proxy only on the first request...

AlexIzydorczyk commented 9 years ago

Looks like even when I set max-timeout to 5 minutes, I get a timeout on a different front:

2015-06-19 00:41:11.837282 [render] [28935520] loadFinished: RenderErrorInfo(type='Network', code=4, text=u'Socket operation timed out', url=u'http://google.com/')
qrilka commented 9 years ago

@AlexIzydorczyk did you try to use something else than Google? google.com is known to do redirects to local Google DNS names:

 ~ $ curl -v http://google.com
* Rebuilt URL to: http://google.com/
*   Trying 188.43.66.99...
* Connected to google.com (188.43.66.99) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.42.1
> Accept: */*
> 
< HTTP/1.1 302 Found
< Cache-Control: private
< Content-Type: text/html; charset=UTF-8
< Location: http://www.google.ru/?gfe_rd=cr&ei=PLSDVbC5Go2DZKmFgeAC
< Content-Length: 256
< Date: Fri, 19 Jun 2015 06:18:36 GMT
< Server: GFE/2.0
< Alternate-Protocol: 80:quic,p=0
< 
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.ru/?gfe_rd=cr&amp;ei=PLSDVbC5Go2DZKmFgeAC">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact

Probably that could get in the way. Could you try to request something simpler with no redirects, e.g. http://httpbin.org/get ?

AlexIzydorczyk commented 9 years ago

@qrilka,

Looks like I'm still getting the same problem when behind the crawlera proxy even at httpbin. I'm using mydomain.crawlera.com as a host, 8010 as the password, and the right credentials as they work elsewhere.

AlexIzydorczyk commented 9 years ago

@kmike, Splash-with-Crawlera support for Splash is a planned future update, right?

AlexIzydorczyk commented 9 years ago

@kmike , @qrilka - by the way, does Splash respect system environment variables? That is, if I set http_proxy, would Splash route requests through that?

kmike commented 9 years ago

Hey @AlexIzydorczyk - yes, Splash-with-Crawlera support for Splash is a planned future update.

Splash doesn't respect system environment variables.

AlexIzydorczyk commented 9 years ago

@kmike thanks, just curious - is the Crawlera/Splash fix something that can be done on the Splash side or is it a fix on the Crawlera side?

If it's on the Splash side, mind giving me a rough idea of what the upgrade would entail? So far, I've been getting around it by just making the timeout very very long.

kmike commented 9 years ago

@AlexIzydorczyk it will require some changes to Crawlera, but not in Splash. We're going to provide a Lua module to enable Crawlera in Splash.

AlexIzydorczyk commented 9 years ago

@kmike thanks, makes sense.

In the meantime, I've managed to actually get Crawlera to work by using Squid3 as a pass-through proxy between Crawlera and Splash. It occasionally times out, but when using a cluster of docker containers running Splash, it's useable and the qt5 branch seems to be much more performant.

Your earlier comment about the queue makes a lot of sense now - the biggest bottleneck is trying to efficiently allocate requests between Splash instances without having them timeout too much but also avoiding under using resource (Splash seems to be both memory + CPU intensive).

Perhaps I need to be looking at longer Lua scripts to do multiple pages with one Splash rather than making separate requests to Splash (which I presume have overhead creating new browser objects all the time).

vionemc commented 6 years ago

Scrapy Splash + Crawlera is working except for JSP pages now. Why is that? It's also showing timeout error.