Recommend MultiGzDecoder over GzDecoder in docs

jsha commented 2 years ago

Part of #178.

I removed the text that says "the specification, however, allows ..." and the comment about bioinformatics because they make it sound like MultiGzDecoder is a rare thing that you should only enable if you need it, but it's actually the correct choice for almost all cases since it implements what the RFC considers a "gzip file."

jsha commented 1 year ago

I see you've also made some tweaks to these docs in https://github.com/rust-lang/flate2-rs/pull/347. Thanks! I went ahead and updated this PR, and I still think it's worthwhile.

In this PR I reference the concept of the "gzip file" instead of multistream, which is what the RFC talks about as the core concept (with "members" being a sub-concept).

Also, I removed the list of cases where gzip files containing multiple members are commonly used, since that seems to suggest that gzip files containing multiple members are only used in particular endeavors, when really they can be used any time you have a .gz file or a Content-Encoding: gzip HTTP response.

Similarly, I removed the comment "Use [MultiGzDecoder] if your file has multiple streams" because in many cases the user of this library doesn't know in advance if they will be processing a file with multiple streams or not; but it is always correct to use a MultiGzDecoder because it will handle a single-stream file just fine.

Byron commented 1 year ago

I think it would be nice if @JohnTitor could have another look. Overall, I believe this PR would help to get a better understanding on the intended use of MultiGzDecoder, and I would definitely have benefitted from that while working with #348 . So I see a lot of value in getting it merged. Thank you.

joshtriplett commented 1 year ago

Whether the specification supports it or not, in many cases I think people will be surprised if they encounter a gzip stream with multiple files in it. I think that people typically think of gzip as a single-file format, and combine it with things like tar when storing multiple files.

So, I don't think we should go as far as saying that most people should use MultiGzDecoder. I think it's perfectly fine to adjust the documentation to say that it's allowed, and explain the distinction, but I think we should actually explain the distinction and the common usage rather than presenting MultiGzDecoder as the most common case.

joshtriplett commented 1 year ago

I added some suggestions implementing my comment https://github.com/rust-lang/flate2-rs/pull/324#issuecomment-1644854362 .

I am not attached to exact wording here, feel free to wordsmith. Just going for the general spirit of documenting the situation without pushing people towards MultiGzDecoder.

workingjubilee commented 1 year ago

Is there a disadvantage to using MultiGzDecoder when the stream only has a single file?

jsha commented 1 year ago

in many cases I think people will be surprised if they encounter a gzip stream with multiple files in it.

I think people will not be surprised, because their tools do not distinguish a gzip file containing multiple members from a gzip file containing a single member. For instance, here's a gzip file:

file.txt.gz

If you run it through gzip -dc or zcat you will get:

$ gzip -dc < file.txt.gz 
jsha is the new administrative assistant

However, if you run this Rust program (based on the GzDecoder example), you will get a very incorrect result ("jsha is the new admin") with no error:

use flate2::read::GzDecoder;
use std::io::prelude::*;
use std::{fs, io};

fn decode_reader(bytes: &[u8]) -> io::Result<String> {
    let mut gz = GzDecoder::new(&bytes[..]);
    let mut s = String::new();
    gz.read_to_string(&mut s)?;
    Ok(s)
}

fn main() {
    let contents = fs::read("file.txt.gz").unwrap();
    let decoded = decode_reader(&contents).unwrap();
    println!("{}", decoded);
}

$ cargo run
   Compiling zcat v0.1.0 (/home/jsha/zcat)
    Finished dev [unoptimized + debuginfo] target(s) in 0.57s
     Running `target/debug/zcat`
jsha is the new admin

As another concrete example: I came across this issue because I maintain an HTTP client, ureq. We implement automatic gzip decoding. When investigating a bug, I discovered that we were using GzDecoder instead of MultiGzDecoder. That's wrong; a response with Content-Encoding: gzip a gzip file and may very well have multiple members. But it's particularly scary because when that happens, people would get only part of the file, and no error indicating anything was wrong.

So I think we should push people towards MultiGzDecoder. It's doesn't have a "silently do the wrong thing" failure mode. If we were designing from scratch, I would argue that the library shouldn't have a single-stream decoder that wraps Read / BufRead, because of the likelihood of silently dropping data. Instead, to support use cases that want to know about stream boundaries, there should be a secondary API that yields an Iterator<Item = GzipMember>, where GzipMember implements read. And then if the user chooses to drop the Iterator after reading a single member that's fine, but it's clear that there may be multiple members to deal with.

Now, I will say this, even though it slightly contradicts my point: In writing up my arguments I also went to check what various HTTP clients do with Content-Encoding: gzip and a response that contains multiple members. To my great surprise, the ones I checked seem to fail after the first member. For instance:

$ curl https://jacob.hoffman-andrews.com/file.txt --compressed
curl: (23) Failed writing received data to disk/application
jsha is the new admin

And Chrome and Firefox both display "jsha is the new admin" without any indication of error. I would still argue that this is a bug, but it is evidently a common one. I know all three programs use zlib; perhaps the API design of zlib encourages this particular bug?

workingjubilee commented 1 year ago

I agree with jsha: the evidence that Rust programs which do not adhere to this advisory will drop data on the floor is evidence enough that we should in fact recommend MultiGzDecoder is used. While not memory unsoundness, Rust programmers often are... irked when they find out an interface that they were using sends half or more of their data to /dev/null.

jongiddy commented 1 year ago

There is a difference in behavior that should be noted. MultiGzDecoder behaves differently for a gzip member embedded in other data. If there is non-gzip data following the gzip data GzipDecoder succeeds, MultiGzDecoder fails.

As an example, for the write versions GzipDecoder returns the number of bytes consumed, leaving any remaining bytes for the caller to handle. MultiGzDecoder returns an Err since it expects more gzip data.

Correct behavior depends on knowledge of the data. Is it only gzip data and may have multiple members, or is it a stream of data elements that contains gzipped data? HTTP clients have a real problem here as some servers send multi-member gzip responses and some other servers send additional data after the gzip-encoded content. There is probably a server that does both.

That is why it is legitimate to have two types of decoder, and it is the caller who needs to choose which is most appropriate for their data.

Byron commented 1 year ago

After having read this I feel that instead of nudging towards a particular implementation, it would be preferable to teach about the concept of "members" on the front-page while providing some of the examples we have seen here to explain what this can mean for an application. From there, users should be able to make the choice for their particular application.

For this PR this could mean one more section in the lib.rs documentation, along with the explicit lack of preferences in the wording for the documentation of Gz and MultiGz variants of types.

If this sounds feasible, give this comment a thumbs up. Otherwise, you could thumbs down (just to have an easy count) along with suggestions on what to do instead. Thanks everyone!

joshtriplett commented 1 year ago

(FWIW, I withdraw some of the github suggestions I left, in favor of what's summarized in this comment.)

@jongiddy wrote:

MultiGzDecoder behaves differently for a gzip member embedded in other data. If there is non-gzip data following the gzip data GzipDecoder succeeds, MultiGzDecoder fails.

That's one of the reasons, yeah. We should document that as a reason people should consider when choosing what to use.

Also, as @jsha noted, single-member decoding seems to be what many browsers and web clients do, and it'll be surprising if you get different results in different tools. Especially if the subsequent members are silently appended when they may be intended to be different "files".

It'd be worth coordinating with browser vendors to find out whether they have any practical considerations here (e.g. some hard-won knowledge about what happens in the wild). Are there some Firefox or Curl developers in the relevant area that we could poke?

@jsha wrote:

If we were designing from scratch, I would argue that the library shouldn't have a single-stream decoder that wraps Read / BufRead, because of the likelihood of silently dropping data. Instead, to support use cases that want to know about stream boundaries, there should be a secondary API that yields an Iterator<Item = GzipMember>,

Yeah, I'd agree with this. We could still add that API, which seems much more reasonable than MultiGzDecoder since it doesn't silently traverse the boundaries between files. And that API seems more reasonable to steer people towards, because it doesn't make the same tradeoffs of not being able to continue reading. In particular, if we're careful, you could iterate over the first member and then read the underlying stream for non-gzip data (without any issue at the boundary). (The iterator type could have a "give me the rest of the data as a Read/BufRead instance" method that consumes the iterator.) And conversely, we could have a method on the iterator that consumes the iterator and gives you a single stream but errors if there's more data afterwards, so that when people aren't expecting appended data they can check that assumption.

Summarizing:

Could we check with Firefox and/or Curl developers to find out if there's practical knowledge here that we're lacking? Or, are there open bugs, where knowledge like "this is a bug but here's why we can't fix it" or "this is a bug and we should actually fix it but we haven't yet" might live? If they have specific hard-won considerations about data in the wild, I'd be happy to defer to those. But the fact that multiple pieces of software out there don't decode multiple streams by default makes me hesitant to diverge.
In the meantime, it seems entirely reasonable for the documentation to document, at length, the considerations that should go into picking between the options we have. Perhaps we could explain this in detail at the top level documentation of the crate, and then link that from all the various decoders to avoid duplication? This includes explaining the single-stream/multi-stream issue, the tradeoffs about being able to have non-gzip data after one gzip stream, the fact that the single-stream decoder will ignore a subsequent gzip stream, the fact that many pieces of software don't seem to decode multiple streams...
We might be able to have a better API that better encourages people to explicitly handle this consideration. I don't think we should block merging better documentation until we have a better API, but I do think that having a better API would make the issue visible in the type system so that people need to handle it.

workingjubilee commented 1 year ago

Hearing there are common use-cases that combine a gz-encoded stream with something else does seem worth thinking about. It feels like people should be introduced to the idea that if they are handling a simple "gz file" that they probably should just handle it with MultiGzDecoder but that if they are handling a more complex encoding they might need to do something more complex and that is where GzDecoder comes into play. Ironically, it is the "simpler" version that is used for the "complex" cases, which violates many intuitions.

jongiddy commented 1 year ago

Exactly, and it is also the case that multi-member gzip files are uncommon (outside of Bioinformatics) so GzDecoder works for the common use case most of the time, which makes it unexpected when it doesn't.

I wonder whether it is worth introducing new types GzipDecoder and GzipMemberDecoder that alias MultiGzDecoder and GzDecoder and then deprecate the Gz types? On one hand it fixes the simple vs complex problem, on the other it might just add more confusion.

workingjubilee commented 1 year ago

Maybe in the future? It seems like it'd just add more confusion right now. Improving docs and adding the hypothetical iterator seems like the "way to go" in terms of cleaning up the API, probably.

jongiddy commented 1 year ago

The iterator idea is a cool abstraction of the problem but not really needed by users of this crate. Almost no-one cares about the individual members, they just want all the uncompressed data from the gzip data whether or not it is multi-member and whether or not it is embedded in other data. The one exception of which I am aware is the BGZF format, where an index is used to seek to the correct gzip member and then uncompress just that one member. Neither benefits from the iterator abstraction.

workingjubilee commented 1 year ago

I don't disagree that most users will basically go .iter().collect::<String>() or whatever, but it was also feature requested in https://github.com/rust-lang/flate2-rs/issues/192 for "My data is HUGE! Please let me lazily iterate it!" use-cases.

the8472 commented 1 year ago

The iterator idea is a cool abstraction of the problem but not really needed by users of this crate. Almost no-one cares about the individual members, they just want all the uncompressed data from the gzip data whether or not it is multi-member and whether or not it is embedded in other data.

As a low-level crate it still makes sense to expose the basic building blocks for those who need it and then provide an abstraction for the most common use-cases.

workingjubilee commented 1 year ago

along with the explicit lack of preferences in the wording for the documentation of Gz and MultiGz variants of types.

If this sounds feasible, give this comment a thumbs up. Otherwise, you could thumbs down (just to have an easy count) along with suggestions on what to do instead. Thanks everyone!

Altogether, for this PR in particular, I'm in favor of this, modulo "we shouldn't be shy about saying that MultiGzDecoder is the simpler option for a 'basic' file". :+1:

Could we check with Firefox and/or Curl developers to find out if there's practical knowledge here that we're lacking? Or, are there open bugs, where knowledge like "this is a bug but here's why we can't fix it" or "this is a bug and we should actually fix it but we haven't yet" might live? If they have specific hard-won considerations about data in the wild, I'd be happy to defer to those. But the fact that multiple pieces of software out there don't decode multiple streams by default makes me hesitant to diverge.

Let's make this a separate issue, but I suspect this might be a case of "EXTREMELY pedantic adherence to a specific RFC's spec in a way that is unnecessarily underachieving".

workingjubilee commented 1 year ago

Also see:

https://github.com/rust-lang/flate2-rs/issues/339

This issue is an extremely common papercut.

Byron commented 1 year ago

I have added a new top-level section to explain the intricacies of the gzip file format. Further I did my best to integrate all of Josh's suggestions, while also not applying a new paragraph if it appeared to say the same thing as a paragraph above it.

I particularly invite for a thorough review of the new top-level section.

jsha commented 1 year ago

Okay, pushed a batch of changes. Sorry they wound up bigger than I intended; I was hoping to converge piecemeal, but wound up wanting to try and get all six places using the same language.

Could we check with Firefox and/or Curl developers to find out if there's practical knowledge here that we're lacking?

This is a good idea. I'm not likely to get it done soon, but I think it would be really informative and interesting.

To make sure I got it all right, copy-pasting from the generated docs for {Read,BufRead,Write}{Gz,MultiGz}Decoder:

bufread::GzDecoder

A decoder for the first member of a gzip file.

This structure exposes a BufRead interface, reading compressed data from the underlying reader, and emitting uncompressed data.

After reading the first member of a gzip file (which is often, but not always, the only member), this reader will return Ok(0) even if there are more bytes available in the underlying reader. If you want to be sure not to drop bytes on the floor, call into_inner() after Ok(0) to recover the underlying reader.

To handle gzip files that may have multiple members, see MultiGzDecoder or read more in the introduction.

read::GzDecoder

A decoder for the first member of a gzip file.

This structure exposes a Read interface that will consume compressed data from the underlying reader and emit uncompressed data.

After reading the first member of a gzip file (which is often, but not always, the only member), this reader will return Ok(0) even if there are more bytes available in the underlying reader. If you want to be sure not to drop bytes on the floor, call into_inner() after Ok(0) to recover the underlying reader.

To handle gzip files that may have multiple members, see MultiGzDecoder or read more in the introduction.

write::GzDecoder

A decoder for the first member of a gzip file.

This structure exposes a Write interface, receiving compressed data and writing uncompressed data to the underlying writer.

After decoding the first member of a gzip file, this writer will return XXX to all subsequent writes.

To handle gzip files that may have multiple members, see MultiGzDecoder or read more in the introduction.

bufread::MultiGzDecoder

A gzip streaming decoder that decodes a gzip file that may have multiple members.

This structure exposes a BufRead interface that will consume compressed data from the underlying reader and emit uncompressed data.

A gzip file consists of a series of members concatenated one after another. MultiGzDecoder decodes all members of a file and returns Ok(0) once the underlying reader does.

To handle members seperately, see GzDecoder or read more in the introduction.

read::MultiGzDecoder

A gzip streaming decoder that decodes a gzip file that may have multiple members.

This structure exposes a Read interface that will consume compressed data from the underlying reader and emit uncompressed data.

A gzip file consists of a series of members concatenated one after another. MultiGzDecoder decodes all members of a file and returns Ok(0) once the underlying reader does.

To handle members seperately, see GzDecoder or read more in the introduction.

write::MultiGzDecoder

A gzip streaming decoder that decodes a gzip file with multiple members.

This structure exposes a Write interface that will consume compressed data and write uncompressed data to the underlying writer.

A gzip file consists of a series of members concatenated one after another. MultiGzDecoder decodes all members of a file and writes them to the underlying writer one after another.

To handle members separately, see GzDecoder or read more in the introduction.

jongiddy commented 1 year ago

We can wordsmith endlessly, so these are just suggestions that would clarify things for me. The only one that definitely needs to be changed is the instance that currently says writer will return XXX.

Byron commented 1 year ago

Thanks a lot, @jongiddy , for pushing this PR over the finishing line.

I have applied all of your suggestions and followed up on all comments as well as I could. Given the massive size of this PR, working with it doesn't get easier and I would like to make the cut here by merging what we have, and invite those interested to double-check what's there to possibly submit follow-up PRs for fixes.

I for one learned a lot here and believe this PR represents a substantial improvement to the docs that will also help users to avoid these member-releated pitfalls (along with pitfalls related to GzDecoder reading a little bit ahead of the underlying stream so one wants to use the BufRead implementation to see these nonetheless).

rust-lang / flate2-rs

Recommend MultiGzDecoder over GzDecoder in docs #324