igrigorik / em-http-request

Asynchronous HTTP Client (EventMachine + Ruby)
1.22k stars 222 forks source link

Fix gzip decompression in cases where first bytes arrive slowly #269

Closed turley closed 8 years ago

turley commented 10 years ago

I've noticed intermittent errors with gzip-encoded responses when communicating with web services that support gzip.

After some investigation, I found in certain cases (e.g., when the first few bytes arrive individually at the decoder, followed by the remainder of the body), gzip-encoded response bodies are causing the GZip decoder to raise a DecoderError (really a Zlib::DataError).

For an example that illustrates the breaking behavior (when run against the current implementation), refer to the included "should decompress a vanilla gzip file from 3 individual bytes followed by remaining bytes" test.

By removing the custom GZipHeader implementation, I get consistently correct results regardless of how the data arrives. I'm not entirely sure why the custom gzip header extraction was ever needed since zlib appears capable of handling the header itself. If someone can provide a test case showing the need for GZipHeader, please let me know. Otherwise, it appears unnecessary (and problematic in this case).

igrigorik commented 10 years ago

Custom gzip implementation is to allow for streaming gzip decodes. Some background: https://github.com/igrigorik/em-http-request/search?q=streaming+gzip&ref=cmdform&type=Issues

A lot of people rely on above functionality, so that's not something we can remove.

turley commented 10 years ago

If a lot of people rely on that functionality, would you mind providing an example or test that demonstrates that streaming functionality in action? And no, the "should extract the stream of a vanilla gzip" test doesn't actually test streaming decompression as far as I can tell.

From my experience, it doesn't seem to actually stream/output decompressed output until the end of the stream anyway, but maybe I'm not using a good test example...

My point is, if it really is critical, there should probably be a working test for that (right?) and if there's no test, I question the necessity for this custom implementation.

igrigorik commented 10 years ago

Try connecting to Twitter firehose API.

I'd love to have a solid test for it, it's just that the Ruby plumbing / support for it is miserable -- which is why the custom implementation exists in the first place.

Also, it sounds like there is a bug that we need to fix for your case... We should look into that. But dropping streaming gzip support is not the way to solve it.

turley commented 10 years ago

From what I can tell, the Twitter firehose endpoint requires special permission to access (and I don't have that special permission). Surely there must be a more simple/testable example case?

At the end of the day, miserable or not, I would say that if a test isn't provided for the functionality, then that functionality probably shouldn't exist here. Just my opinion though.

memefrenzy commented 10 years ago

Your I/O logic is broken. You are on a non-blocking I/O socket so you can get partial reads. If you do then the state loop happily throws away the bytes read so far (see the buffer = "" at the top of the loop) and reads some (maybe not all even) of the stream again, which obviously breaks. Make your read() function loop 'till it's got all of it's input or return false on eof. Problem fixed.