scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.08k stars 515 forks source link

"Accept-Encoding" is preventing content from being automatically decompressed #324

Open jredl-va opened 8 years ago

jredl-va commented 8 years ago

I've found a case where I need to specify the "accept-encoding" header in order to correctly access the content I'm attempting to scrape (without the header the site is presenting a bot detection captcha).

Example of the lua script I'm passing to the execute api:

function main(splash)
  local url = splash.args.url
  splash:set_custom_headers({
     ["Connection"] = "keep-alive",
     ["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",      
     ["Accept-Encoding"] = "gzip, deflate, sdch",
     ["Accept-Language"] = "en-US,en;q=0.8",
  })
  assert(splash:go(url))
  assert(splash:wait(3.0))
  return {
    html = splash:html(),
  }
end

It appears that there is an underlying issue that is preventing the content from automatically being decompressed: https://github.com/scrapinghub/splash/blob/master/splash/proxy_server.py#L90

Is there a workaround to force decompression with the existence of the accept-encoding header?

jredl-va commented 8 years ago

image

Here is a screen capture of the non compressed content.

dvdbng commented 8 years ago

Sound related to this webkit bug: https://bugs.webkit.org/show_bug.cgi?id=63696

(BTW, that function you are referencing in the first message is not causing this, that module is for the "splash as a proxy server" feature and it's not being used normally).

jredl-va commented 8 years ago

Sorry my initial comment was not clear as I had intended to point out that issue and not the proxy code in splash.

Is there a way to force decompression to occur?

What's happening in my case is the initial response content is gzipped. Standard browsers decompress the content and a JavaScript redirect occurs to the actual page after some cookie negotiation.

jredl-va commented 8 years ago

@Youwotma we compiled a custom version of webkit / phantomjs were we modifed the setheader line too read:

request.setHeaderField("Accept-Encoding", "gzip, deflate, sdch");

Rather than attempting to override the accept-encoding header. This worked in our phantomjs trial.

Have you guys considered compiling webkit as part of splash or including the binary as part of splash rather than installing it into the docker containers with apt-install? Any ideas on how to accomplish this with Splash?

kmike commented 8 years ago

Have you guys considered compiling webkit as part of splash or including the binary as part of splash rather than installing it into the docker containers with apt-install?

@jredl-va I think this should be a last resort; it'd be really good to avoid it.

The problem doesn't look easy; I don't have an immediate solution. Maybe the easiest workaround is to add "sdch" header in a proxy server connected to the Splash Docker image. Maybe Squid can do that job; it also will give you on-disk HTTP cache shared between many Splash processes. I'm thinking of adding Squid to https://github.com/TeamHG-Memex/aquarium.

jredl-va commented 8 years ago

@kmike interesting. We will give that a try in regards to proxying the requests and will report back.

kmike commented 8 years ago

@jredl-va did you solve it?

cp2587 commented 6 years ago

I have the exact same issue (initial request need to be compressed and some JS script has to be run by splash upon the response received) - did you found a workaround @jredl-va @kmike ?

jredl-va commented 6 years ago

@cp2587 I ended up compiling a custom version of phantom js. Here is the diff

Index: src/network/access/qhttpnetworkconnection.cpp
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/network/access/qhttpnetworkconnection.cpp   (revision 46ab84fdbe0d643414f2f7ecf17fe1c1d1457033)
+++ src/network/access/qhttpnetworkconnection.cpp   (revision )
@@ -289,14 +289,9 @@
     // encoding.
     value = request.headerField("accept-encoding");
     if (value.isEmpty()) {
-#ifndef QT_NO_COMPRESS
         request.setHeaderField("Accept-Encoding", "gzip, deflate");
-        request.d->autoDecompress = true;
-#else
-        // if zlib is not available set this to false always
-        request.d->autoDecompress = false;
-#endif
     }
+    request.d->autoDecompress = true;

     // some websites mandate an accept-language header and fail
     // if it is not sent. This is a problem with the website and
wflanagan commented 6 years ago

FYI.. this is still an issue as of 21 May 2018. How i'm handing it is setting the Accept-Encoding to "identity", but UGH is it slow. (using the latest Docker version).

wflanagan commented 6 years ago

So, for the purposes of showing this.. I'm running the docker version of LATEST (i.e. 3.2).

Using the front end web interface, if you go to the following URL, using the script I'm pasting in, you'll get a bad (i.e. Gzip compressed-ish) page.

Note that I tried to 'handle' this by extracting the GZIP part, and decompressing it.. but it's not a valid GZIp string.. So I'm not quite sure what to do on these.

So, this isn't "fixable" outside of Splash. This is not going through any proxies or anything else.

URL: http://sallysbakingaddiction.com/2018/05/21/angel-food-cupcakes/

function main(splash, args) splash:on_request(function(request) request:set_header("ACCEPT-ENCODING", "gzip, deflate, br") end) assert(splash:go(args.url)) assert(splash:wait(0.5)) return {html=splash:html(), info=splash:har()} end

wflanagan commented 6 years ago

Why should this be a priority to fix?

1) There are websites where Splash with ACCEPT_ENCODING = "identity" setting doesn't return anything, the browser just hangs. I'm not sure if this is because the web server doesn't have an internal cache and takes longer than 30 seconds (the default timeout) to render, or if its an active crawl avoidance technique.

2) Common browsers default to an ACCEPT-ENCODING that has a mix of: br, gzip, and deflate. Not having these same settings increases "visibility" of our proxy as not a standard browser. Safari: br, gzip, deflate, Chrome: gzip, deflate, br, Firefox: gzip, deflate, br.

3) Not using compressed documents slows transfer times and the overall throughput of the system. This is particularly true if you're using a proxy, as there is a "double transfer" of the html from the proxy to Splash, and then from Splash to the client.

4) It seems what IS being transferred to the end client is not decompressable/inflatable using (at least in the Ruby case) any of the standard libraries for inflating compressed HTML pages. I tried to fix this by taking what Splash delivers, and handling it myself. But, it seems that the resulting string is not compatible with any GZip or Brotli decompressor I can find in my native language (Ruby, and yes, I tried more than 1). My actual client uses Faraday as a wrapper for HTTP requests, and the GZip that is used in Faraday has 0 problems directly handling these sites outside of Splash.

Net is that I'm somewhat stuck. Help!

blablacio commented 5 years ago

I'm also encountering the same problem with latest Splash Docker image. Without Accept-Encoding header, crawler is being blocked and if I specify it, I get compressed output.

Is there any solution other than recompiling WebKit?

StasDeep commented 5 years ago

@kmike is this a bug in WebKit or in Splash? This is a serious problem, as it's easy detectable by anti-bot systems.

blablacio commented 5 years ago

@StasDeep This Dockerfile worked for me along with the patch above:

FROM scrapinghub/splash:3.3.1

RUN apt update && apt install -y g++ libssl-dev

WORKDIR /tmp

RUN wget http://master.qt.io/archive/qt/5.9/5.9.1/single/qt-everywhere-opensource-src-5.9.1.tar.xz
RUN tar xxvf qt-everywhere-opensource-src-5.9.1.tar.xz

WORKDIR /tmp/qt-everywhere-opensource-src-5.9.1

ADD splash/qhttpnetworkconnection.patch .

RUN patch qtbase/src/network/access/qhttpnetworkconnection.cpp < qhttpnetworkconnection.patch
RUN ./configure -confirm-license -opensource -no-compile-examples
RUN make module-qtbase

RUN cp -rf qtbase/lib/libQt5Network.* /opt/qt59/5.9.1/gcc_64/lib
@@ -306,12 +306,13 @@
     if (value.isEmpty()) {
 #ifndef QT_NO_COMPRESS
         request.setHeaderField("Accept-Encoding", "gzip, deflate");
-        request.d->autoDecompress = true;
 #else
         // if zlib is not available set this to false always
         request.d->autoDecompress = false;
 #endif
     }
+
+    request.d->autoDecompress = true;

     // some websites mandate an accept-language header and fail
     // if it is not sent. This is a problem with the website and
blablacio commented 5 years ago

@StasDeep And it's actually a bug (or rather a feature) in Qt. See here.

StasDeep commented 5 years ago

@blablacio thank you for the Dockerfile. It seems like QT_NO_COMPRESS is defined, and due to that I cannot use Accept-Encoding header, because splash just hangs in this case.

StasDeep commented 5 years ago

I made it work by adding -zlib argument to ./configure command in the Dockerfile by blablacio.

mwbelt commented 4 years ago

This problem is still present in Splash 3.4.x, but simply replacing that in the dockerfile doesn't work.

Has anybody managed a workaround for this in the latest version of Splash? The compressed output I get back doesn't seem to be anything I can decompress, so any sites that send compressed output are unscrapable.

Any thoughts on how to make this work on 3.4, since 3.3 doesn't support latest TLS, @kmike, @jredl-va , or @blablacio ?

mwbelt commented 4 years ago

Any thoughts, @kmike , @jredl-va , @blablacio ?

The compressed output is keeping me from doing anything useful with this on a number of sites, and I'd imagine the same is true in-house with Scrapey. Minus this, Splash is absolutely perfectly for a project I'm working on!