gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.2k stars 1.76k forks source link

Interoperability issues with broken gzip responses #511

Open majelbstoat opened 4 years ago

majelbstoat commented 4 years ago

If a server is broken and returns an invalid or missing content-length, colly reports "unexpected EOF", despite having received data.

Repro:

  1. use colly to visit https://regexr.com/39p0t
  2. res.ContentLength is reported as -1
  3. EOF error is returned from httpBackend.Do, even though body has all the content.

(This is an ErrUnexpectedEOF, not a regular EOF, which is why ioutil.ReadAll returns it.)

I would expect colly to return the Response in this case.

WGH- commented 4 years ago

This is not caused by missing Content-Length. Missing Content-Length itself is perfectly fine, since we have chunked TE and HTTP/2 framing to report EOF properly.

This is caused by broken gzip-compressed response returned by the server:

curl -s https://regexr.com/39p0t --header 'Accept-Encoding: gzip' > /tmp/foo.gz; file /tmp/foo.gz
/tmp/foo.gz: gzip compressed data, from Unix, original size modulo 2^32 4294901760 gzip compressed data, reserved method, ASCII, has CRC, extra field, has comment, encrypted, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 4294901760

The server returns a gzip file with 0xffff0000 original size in its header, which is obviously wrong. Go's gzip decompressor keeps track of the advertised size, and returns io.ErrUnexpectedEOF when data ends too early.

I assume most HTTP clients ignore gzip header, and that's why it works fine in browsers.

majelbstoat commented 4 years ago

Oh, huh. Well, I'm running a patched version with this PR applied, and it fixes it, though perhaps for the wrong reason?

I checked and it wasn't going into the gzip reader block to read the response, but perhaps that's something different.

WGH- commented 4 years ago

Well, it fixes it because it suppresses the error, no surprise here.

majelbstoat commented 4 years ago

ok, so confirmed it's still broken with a vanilla client, so fair enough.

request, err := http.NewRequest("GET", "https://regexr.com/39p0t", nil)
response, _ := client.Do(request)
defer response.Body.Close()
b, err := ioutil.ReadAll(response.Body)
if err != nil {
    fmt.Printf("%s\n, err.Error())
}

So, is there any kind of workaround? On balance, I'd rather have data when it's there, even if the server is doing the wrong thing 🤷🏼‍♂️

Adding

c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("Accept-Encoding", "gzip")
    // ... and/or
    r.Headers.Set("Accept", "gzip")
})

doesn't help. Setting ParseHTTPErrorResponse doesn't work, probably because no response is returned to parse.

I don't imagine I'm the first person to have encountered this, but I couldn't see any issues or documentation to help handle it.

WGH- commented 4 years ago

I don't imagine I'm the first person to have encountered this, but I couldn't see any issues or documentation to help handle it.

I'd guess there aren't that many servers out there that return broken gzips, and people running mass crawls don't analyze the logs with scrutiny. I checked my own logs, and indeed, I've also came across such sites, but didn't know about that until now.

WGH- commented 4 years ago

Sorry, I was somewhat wrong: the problem is not in the gzip footer. The compressed deflate stream itself is truncated.

Here's the stack trace when the problem occurs (I patched panic into compress/flate/inflate.go).

compress/flate.(*decompressor).moreBits(0xc0003bb300, 0xc000495d50, 0xc000495e20)
    /usr/lib/go/src/compress/flate/inflate.go:698 +0x93
compress/flate.(*decompressor).nextBlock(0xc0003bb300)
    /usr/lib/go/src/compress/flate/inflate.go:303 +0x36
compress/flate.(*decompressor).Read(0xc0003bb300, 0xc0001a6c00, 0x400, 0x400, 0x14, 0xbfb201a7047502fc, 0x39725978)
    /usr/lib/go/src/compress/flate/inflate.go:347 +0x77
compress/gzip.(*Reader).Read(0xc0000cc840, 0xc0001a6c00, 0x400, 0x400, 0x66b8a0, 0x1, 0xc0001a6c00)
    /usr/lib/go/src/compress/gzip/gunzip.go:251 +0x87
net/http.(*http2gzipReader).Read(0xc0004c6990, 0xc0001a6c00, 0x400, 0x400, 0x2, 0x0, 0x0)
    /usr/lib/go/src/net/http/h2_bundle.go:8917 +0x69
WGH- commented 4 years ago

If it were just bad footer, gzip.ErrChecksum would be returned, which is easy to distinguish.

WGH- commented 4 years ago

What still bothers me is that neither browsers nor curl (curl --compressed https://regexr.com/39p0t) have problems with truncated stream (curl reports successful exit code). Browsers might be the less interesting case, as IIRC they will always show the truncated concent, but I need to understand why curl is ok with that before suggesting some concrete fix for colly.

majelbstoat commented 4 years ago

Yeah, I was wondering the same. Couldn't find anything interesting in curl's trace output, and I'm at the limits of my knowledge here, but I appreciate you taking the time to look at it.

WGH- commented 4 years ago

curl ignores the error here for some reason: https://github.com/curl/curl/blob/8df455479f8801bbebad8839fc96abbffa711603/lib/content_encoding.c#L222-L224

asciimoo commented 4 years ago

Interesting issue. I think we should handle this issue similarly to how the major browsers/http clients handle it.

WGH- commented 2 years ago

I've just encountered another interesting case:

$ curl <redacted> --header 'Accept-Encoding: gzip' -s | gunzip >/dev/null
gzip: stdin: decompression OK, trailing garbage ignore

$ curl <redacted> --compressed
...
        </body>
curl: (23) Failed writing received data to disk/application
</html>

Go returns gzip: invalid header on this site.