paolobarbolini / bzip2-rs

Pure Rust bzip2 decoder
Apache License 2.0
44 stars 6 forks source link

Multistream decoding support #5

Open paolobarbolini opened 3 years ago

paolobarbolini commented 3 years ago

Support decoding multistream bzip2 files. Originally posted by @ygyzys at https://github.com/alexcrichton/bzip2-rs/issues/75

paolobarbolini commented 3 years ago

@paolobarbolini Hello, thanks for reply. I want to add it to bzip2-rs, but it seems to be a breaking change. In my view, we should remove ReadState::Eof entirely, since Decoder can never know that the end of file happened (it can only know that the end of block happened). Instead, DecoderReader should check for the end of file in the original stream (which it doesn't do now, and actually it seems to me that it can get stuck in infinite loop of feeding zero bytes to Decoder on some corrupted files which contain the header but end abruptly).

In any case, I'm more interested in pure Rust crate, so I suggest to create an issue there and continue our discussion of this there if possible.

And btw, I found a place where it is handled in the reference implementation: https://gitlab.com/federicomenaquintero/bzip2/-/blob/master/bzip2.c#L455 If you change this line to unconditional break;, bzip2 starts to give identical result to these crates.

- https://github.com/alexcrichton/bzip2-rs/issues/75#issuecomment-826299753

Thanks for the detailed answer. I have a refactor that improves how Decoder interacts with Block. After landing that this feature should become easier to implement, so by that point we should be able to come-up with a good implementation for this.

ygyzys commented 3 years ago

OK, I'll probably do a quick patch of DecoderReader anyway for my little project, and then we'll see how to merge it with that refactoring.

After thinking a little more I'm also not sure whether we should remove Eof from Decoder interface entirely or leave it messaging about the end of block to the user. The problem is, I'm not sure what is a supposed use case for Decoder at all - is it needed in the public interface? Of course, the user may want to tamper with the buffer size of 1024, but this particular problem can be resolved by adding a setting to DecoderReader, akin to setvbuf in stdio.h library.