Closed andverb closed 9 years ago
Have you tried using the Qt5 branch? The master branch uses Qt4 and a relatively older version of Webkit which sometimes results in inaccurate rendering.
Nope, i just discovered it exists. I guess it doesn't have a docker image yet, right? I appreciate any hints about how to install it on Ubuntu 14.04. Thanks!
@andverb Take a look at https://github.com/sunu/splash/tree/py3. It has both a Dockerfile and manual installation instructions for 14.04 at https://github.com/sunu/splash/blob/py3/docs/install.rst. The PR is pending a review though. https://github.com/scrapinghub/splash/pull/251
If you try it, let me know if you run into any problem with the installation (either with docker or manual installation) so that I can fix it :)
@sunu I already have usual version https://github.com/scrapinghub/splash installed in both ways :) $sudo docker pull sunu/splash dont work, and if i do $ sudo docker pull scrapinghub/splash, it will install usual version too, or am i wrong? So only option left is to clone your repo and install it manually, right? Sorry, so many questions i'm a bit new to this.
okay so i cloned repo https://github.com/sunu/splash/tree/py3, and executed $ sudo docker build -t "splash/py3" . in dockerfiles/splash-jupyter/ It started building but i got this: E: Unable to locate package libzmq3 i installed it with $ sudo apt-get install libzmq3-dev And run build again, but received same error
To install Splash you have to build the Dockerfile in the root directory of the cloned repo first. The one you're building now is for the ipython interface. Since this isn't merged right now, the base version doesn't exist and you have to manually mention it in the splash-jupyter dockerfile.
Hope that makes sense. I'm on mobile right now. Will try to explain better later if you continue to have problem.
@sunu Thanks, i managed to build and run it, but it only shows "Initializing..." even when i try to open google with it. This is terminal output:
2015-08-27 19:42:02.269320 [-] "172.17.42.1" - - [27/Aug/2015:19:42:01 +0000] "GET / HTTP/1.1" 200 5392 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:03.002353 [-] "172.17.42.1" - - [27/Aug/2015:19:42:02 +0000] "GET /favicon.ico HTTP/1.1" 404 153 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:03.045668 [-] "172.17.42.1" - - [27/Aug/2015:19:42:02 +0000] "GET /favicon.ico HTTP/1.1" 404 153 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:05.551677 [-] "172.17.42.1" - - [27/Aug/2015:19:42:04 +0000] "GET /info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 10953 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:05.808849 [-] "172.17.42.1" - - [27/Aug/2015:19:42:04 +0000] "GET /_harviewer/css/harViewer.css HTTP/1.1" 404 145 "http://localhost:8050/info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:05.810939 [-] "172.17.42.1" - - [27/Aug/2015:19:42:05 +0000] "GET /_harviewer/scripts/require.js HTTP/1.1" 404 145 "http://localhost:8050/info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
2015-08-27 19:42:06.044787 [-] "172.17.42.1" - - [27/Aug/2015:19:42:05 +0000] "GET /_harviewer/scripts/require.js HTTP/1.1" 404 145 "http://localhost:8050/info?wait=0.5&images=1&expand=1&url=http%3A%2F%2Fgoogle.com&lua_source=function+main%28splash%29%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
Hey @andverb ,
Looks like you don't have the git submodules cloned. Harviewer is a submodule in https://github.com/sunu/splash/tree/master/splash/vendor. So you should either recursively clone the git repo or do git submodule update --init --recursive
and then build the docker image.
Hope this helps.
@sunu okay i managed to build it and run, but it cant render the page i need, same result as base version, loads only part of webpage. Maybe its errors in page javascript?
@andverb there is now qt5-based docker image available: use
docker run -it -p 8050:8050 scrapinghub/splash:tmp-qt5
For me the following render script:
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
renders the following image when tmp-qt5 Splash is used:
@kmike thank you good sir, everything works like charm now
@andverb - great to hear that!
function main(splash) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(5)) return { html = splash:html(), png = splash:png(), har = splash:har(), }
Hello, I am new to scrapy and splash. I'm wondering if this javascript code should be in the spider.py?
I'm trying to scrape javascript-heavy page using Scrapy+Splash, but Splash cant render it and just shows few links and top menu. There are template tags in HTML code it returns, my guess is that not all javascript being executed. Even in web-interface (i tried setting wait timer for up to 30 seconds). I tried PhantomJS and selenium - they both work fine, but they are slow compared to Splash. Here is an example page i'm trying to scrape: http://profile.majorleaguegaming.com/crosswrecks/forums
Any idea about what can cause this? I checked docs, tried changing a few options, but with zero effect. Thanks.