Open Shnatsel opened 3 years ago
If I understand things correctly, you should be able to get the compressed response using allow_compression(false)
while manually inserting the relevant header using header("Accept-Encoding", "gzip, deflate")
.
I tried http://landolts.com
and I get a corrupt deflate stream error with the rust_backend
as well as the miniz-sys
features of the flate2
crate. Using the zlib
does not yield a decompression error but an unexpected EOF instead.
In all cases, the actual body seems to be decompressed completely. I therefore wonder whether the content length reported by the server is correct...
Here is the body of the above request attached: body.gz
It came with the headers:
server: "nginx"
date: "Fri, 05 Feb 2021 20:28:22 GMT"
content-type: "text/html; charset=UTF-8"
content-length: "6068"
connection: "close"
x-powered-by: "PHP/7.0.0p1"
set-cookie: "PHPSESSID=ac27hsia4s1obmtvrk6jetrf40; path=/"
set-cookie: "mobile=false; path=/"
set-cookie: "user-agent=330cf4ec2a9149ebd093962feb701e34; path=/"
expires: "Mon, 26 Jul 1997 05:00:00 GMT"
cache-control: "no-store, no-cache, must-revalidate"
cache-control: "post-check=0, pre-check=0"
pragma: "no-cache"
last-modified: "Fri, 05 Feb 2021 20:28:22 GMT"
content-encoding: "gzip"
vary: "Accept-Encoding"
gzip does not seem to like it either:
> zcat body.gz
...
gzip: body.gz: unexpected end of file
but that also suggesta that the unexpected EOF I got using the zlib
feature is just the way it says corrupt deflate stream...
From reading into cURL's source, my initial guess would be that its handling of expected but ignored trailer bytes in https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L135 might make the difference...
And cURL's seems ignore an error condition which I do not understand yet: https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L221
I wonder if we could bypass this issue by simply sending Accept-Encoding: gzip
instead of Accept-Encoding: gzip, deflate
. In practice I think deflate is almost never used by websites. And chances are that gzip is supported if deflate also is. I think reqwest only supports gzip as well.
Worst case, gzip is not supported and the content is sent in plain.
I've also seen 20 "invalid gzip header" errors in the top 1M. Here's the data: invalid-gzip-header.tar.gz
That does sound like the issues with "deflate" encoding that the article talks about.
I wonder if we could bypass this issue by simply sending Accept-Encoding: gzip instead of Accept-Encoding: gzip, deflate.
I do not yet understand is how this relates to my tests against http://landolts.com
: The server is nginx, i.e. not a Microsoft implementation, the headers indicate that the result is gzip-encoded. Do you think the header is incorrect and this is a deflate-stream nonetheless?
The error message comes from https://github.com/rust-lang/flate2-rs/blob/90d9e5ed866742ce8b3946d156830e300d1e5aab/src/zio.rs#L152 and this code is generic w.r.t. to gzip or deflate headers, so I don't think it refers to the actual format in use.
I tried playing with the accept endoing header that we send to landolts.com
, and the error occurs if we have gzip
in the accepted encodings, but not deflate or identity. So it seems like their server configuration might be broken, the gzip they are sending is not really gzip.
the gzip they are sending is not really gzip.
While I agree in principle, the observation that both cURL and Firefox are able to handle this suggests there are workarounds. Especially, even us and flate2
basically decompress everything and only fail at EOF. Judging from the cURL code, there is quite a bit of variability of how gzip is implemented in the wild.
For what it's worth, I did a test with reqwest, and it seems like it also has this problem. It would be neat to get to the bottom of this and fix it across the ecosystem.
use reqwest::blocking::Client;
fn main() -> Result<(), reqwest::Error> {
let client = Client::new();
let req = client
.get("http://landolts.com")
.header("Accept-Encoding", "gzip")
.build()?;
println!("{:?}", req.headers());
let resp = client.execute(req)?;
println!("{}", resp.text()?);
Ok(())
}
{"accept-encoding": "gzip"}
Error: reqwest::Error { kind: Decode, source: Custom { kind: UnexpectedEof, error: "unexpected end of file" } }
I think we might be able to find some information on this stack overflow answer by Mark Adler.
I have the same test code implemented for 4 clients and growing in https://github.com/Shnatsel/rust-http-clients-smoke-test, it might come in handy for comparing behavior between clients.
My current guess is that flate2
expects the stream to end as described in https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L344, i.e. with a CRC and a size field, whereas cURL tries to read the trailer, but only errs if there is extra data, not if part of the trailer is missing: https://github.com/curl/curl/blob/ecb13416e316fc1c781f865d2bb7e74462ef793b/lib/content_encoding.c#L135
But admittedly, I am not very confident in my reading of the cURL code. But at least, missing CRC and size information would explain why the body is completely decompressed and only then an error is raised. It would also make sense to e.g. give flate2
a flag that make its processing more lenient w.r.t. this redundant information.
Golang's http library has this issue as well. Looks like curl is one of the few places that figured it out.
Some websites, such as
hajime.us
, fail to load using attohttpc:Io Error: corrupt deflate stream
. They load fine using Firefox and the curl command-line tool.Tested using this code. Test tool output from all affected websites: attohttpc-deflate-corrupt-stream.tar.gz
40 websites out of the top million from Feb 3 Tranco list are affected.
I suspect this is an issue with the underlying DEFLATE implementation, but assistance in isolating the failure (e.g. dumping the DEFLATE stream so I could report a bug against miniz_oxide) would be appreciated.