scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.09k stars 513 forks source link

Splash can't render javascript - heavy page #281

Closed andverb closed 9 years ago

andverb commented 9 years ago

I'm trying to scrape javascript-heavy page using Scrapy+Splash, but Splash cant render it and just shows few links and top menu. There are template tags in HTML code it returns, my guess is that not all javascript being executed. Even in web-interface (i tried setting wait timer for up to 30 seconds). I tried PhantomJS and selenium - they both work fine, but they are slow compared to Splash. Here is an example page i'm trying to scrape: http://profile.majorleaguegaming.com/crosswrecks/forums

Any idea about what can cause this? I checked docs, tried changing a few options, but with zero effect. Thanks.

sunu commented 9 years ago

Have you tried using the Qt5 branch? The master branch uses Qt4 and a relatively older version of Webkit which sometimes results in inaccurate rendering.

andverb commented 9 years ago

Nope, i just discovered it exists. I guess it doesn't have a docker image yet, right? I appreciate any hints about how to install it on Ubuntu 14.04. Thanks!

sunu commented 9 years ago

@andverb Take a look at https://github.com/sunu/splash/tree/py3. It has both a Dockerfile and manual installation instructions for 14.04 at https://github.com/sunu/splash/blob/py3/docs/install.rst. The PR is pending a review though. https://github.com/scrapinghub/splash/pull/251

If you try it, let me know if you run into any problem with the installation (either with docker or manual installation) so that I can fix it :)

andverb commented 9 years ago

@sunu I already have usual version https://github.com/scrapinghub/splash installed in both ways :) $sudo docker pull sunu/splash dont work, and if i do $ sudo docker pull scrapinghub/splash, it will install usual version too, or am i wrong? So only option left is to clone your repo and install it manually, right? Sorry, so many questions i'm a bit new to this.

andverb commented 9 years ago

okay so i cloned repo https://github.com/sunu/splash/tree/py3, and executed $ sudo docker build -t "splash/py3" . in dockerfiles/splash-jupyter/ It started building but i got this: E: Unable to locate package libzmq3 i installed it with $ sudo apt-get install libzmq3-dev And run build again, but received same error

sunu commented 9 years ago

To install Splash you have to build the Dockerfile in the root directory of the cloned repo first. The one you're building now is for the ipython interface. Since this isn't merged right now, the base version doesn't exist and you have to manually mention it in the splash-jupyter dockerfile.

Hope that makes sense. I'm on mobile right now. Will try to explain better later if you continue to have problem.

andverb commented 9 years ago

@sunu Thanks, i managed to build and run it, but it only shows "Initializing..." even when i try to open google with it. This is terminal output:

2015-08-27 19:42:02.269320 [-] "172.17.42.1" - - [27/Aug/2015:19:42:01 +0000] "GET / HTTP/1.1" 200 5392 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:03.002353 [-] "172.17.42.1" - - [27/Aug/2015:19:42:02 +0000] "GET /favicon.ico HTTP/1.1" 404 153 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:03.045668 [-] "172.17.42.1" - - [27/Aug/2015:19:42:02 +0000] "GET /favicon.ico HTTP/1.1" 404 153 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:05.551677 [-] "172.17.42.1" - - [27/Aug/2015:19:42:04 +0000] "GET /info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 10953 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:05.808849 [-] "172.17.42.1" - - [27/Aug/2015:19:42:04 +0000] "GET /_harviewer/css/harViewer.css HTTP/1.1" 404 145 "http://localhost:8050/info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:05.810939 [-] "172.17.42.1" - - [27/Aug/2015:19:42:05 +0000] "GET /_harviewer/scripts/require.js HTTP/1.1" 404 145 "http://localhost:8050/info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:06.044787 [-] "172.17.42.1" - - [27/Aug/2015:19:42:05 +0000] "GET /_harviewer/scripts/require.js HTTP/1.1" 404 145 "http://localhost:8050/info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
sunu commented 9 years ago

Hey @andverb , Looks like you don't have the git submodules cloned. Harviewer is a submodule in https://github.com/sunu/splash/tree/master/splash/vendor. So you should either recursively clone the git repo or do git submodule update --init --recursive and then build the docker image. Hope this helps.

andverb commented 9 years ago

@sunu okay i managed to build it and run, but it cant render the page i need, same result as base version, loads only part of webpage. Maybe its errors in page javascript?

kmike commented 9 years ago

@andverb there is now qt5-based docker image available: use

docker run -it -p 8050:8050 scrapinghub/splash:tmp-qt5

For me the following render script:

function main(splash)
  local url = splash.args.url
  assert(splash:go(url))
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

renders the following image when tmp-qt5 Splash is used: index

andverb commented 9 years ago

@kmike thank you good sir, everything works like charm now

kmike commented 9 years ago

@andverb - great to hear that!

codewithpatch commented 4 years ago
function main(splash)
  local url = splash.args.url
  assert(splash:go(url))
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }

Hello, I am new to scrapy and splash. I'm wondering if this javascript code should be in the spider.py?