jam1garner / binrw

A Rust crate for helping parse and rebuild binary data using ✨macro magic✨.
https://binrw.rs
MIT License
620 stars 36 forks source link

Add SeekFrom::Start, SeekFrom::Current, and nested support to take_seek #292

Open ismell opened 1 month ago

ismell commented 1 month ago

This change adds support for the missing SeekFrom operations. It refactors the internals a bit. We now keep a range of start..end values that are accessible to the TakeSeek. I dropped manually computed stream position and instead call the inner's stream_position(). This does mean that limit needs to take in a &mut. I was also going to change the return type to Result<u64>, but that would force all callers to handle the Result, which I felt like it was a bigger API change.

I fixed the test added in 38b35989ffa27e8bb18cd91f69c6bcc66d2ca03e because StartFrom::Start(0) used to reset the cursor to the start of the underlying buffer.

Fixes jam1garner/binrw#291.

csnover commented 1 month ago

Hi, thanks for your work on this!

If I understand these changes, this patch looks to make TakeSeek into a stream slice/view API, which certainly makes sense as a thing that should exist, but probably not as TakeSeek.

In particular, TakeSeek exists only because std::io::Take does not implement Seek, but binrw needs that trait. As such, it is intended as a drop-in replacement without any other change. Adjusting the signature of limit indicates it’s being changed to do something it’s not intended to do, and changing the origin of the stream is also something that Take doesn’t do, so this probably shouldn’t too.

In any case, the solution for this one problem with the patch is pretty straightforward since Take is just a stream view with origin 0 and a limited size, that TakeSeek would just become a wrapper for this code, and then this code just needs to use a different name to make it clear that it’s creating a slice starting at the current position of the stream.

My experience with sub-streams where the origin matters has always been in the context of reading something like an offset table, so I would probably recommend adding a second convenience function that takes a SeekFrom in addition to the size limit so users don’t have to seek the stream before they get the sub-stream.

Let me know your thoughts. Thanks!

ismell commented 1 month ago

In any case, the solution for this one problem with the patch is pretty straightforward since Take is just a stream view with origin 0 and a limited size, that TakeSeek would just become a wrapper for this code, and then this code just needs to use a different name to make it clear that it’s creating a slice starting at the current position of the stream.

Take on its own has a very clear API. You can read from X..X+len, then read() always returns 0. Seek also has a very clear API where you can position the cursor from an offset. When you combine the two though it is a slice/view API. If the seek method doesn't properly take into account the bounds, then the API gets very confusing.

Take the following example:

#[test]
fn test_stream_len() {
    let mut buf = [0; 8];

    let mut data = Cursor::new("\x00\x01\x02\x03\x04\x05\x06\x07\x08");
    data.seek(SeekFrom::Start(1)).unwrap();

    let mut section = data.take_seek(4);
    assert_eq!(section.stream_len().unwrap(), 4); // <--- Performs X = Current(0), End(0), Start(X).
    assert_eq!(inner_section.read(&mut buf).unwrap(), 4);
    assert_eq!(&buf, b"\x01\x02\x03\x04\x00\x00\x00\x00");
}

Before the fix in 38b35989ffa27e8bb18cd91f69c6bcc66d2ca03e, the test would fail because the returned stream_len() would be 9. The test still fails (without this PR) because the Start(0) call isn't handled in seek(), so the read() call starts reading from the beginning of data.

I think if you want to combine Take + Seek, then we need to handle the different SeekFrom parameters so that the API can't be misused.


As such, it is intended as a drop-in replacement without any other change. Adjusting the signature of limit indicates it’s being changed to do something it’s not intended to do, and changing the origin of the stream is also something that Take doesn’t do, so this probably shouldn’t too.

Unfortunately it stems from https://github.com/rust-lang/rust/issues/59359. stream_position should have probably been defined as fn stream_position(&self) -> Result<u64> and used the ftello(3) syscall instead of lseek(Current(0)).

My experience with sub-streams where the origin matters has always been in the context of reading something like an offset table

In my case, I'm reading a GPT partition table, and I want to parse the contents in each partition. My partition parsers need to seek within the partition using Start and End offsets.

, so I would probably recommend adding a second convenience function that takes a SeekFrom in addition to the size limit so users don’t have to seek the stream before they get the sub-stream.

I think that's a good idea.

Maybe fn take_seek_from(mut self, pos: SeekFrom, limit: u64) -> TakeSeek<Self>?


Thanks for the review!

csnover commented 1 month ago

Hi,

Take on its own has a very clear API. You can read from X..X+len, then read() always returns 0.

I’m trying not to be pedantically irritating here, so apologies in advance if it comes across that way, it is not my intent. A user can call set_limit to reset the end at any time. If TakeSeek were implemented in the way proposed in this patch, that would mean that set_limit would also need to change the origin to whatever is the current position of the stream, because the Rust documentation says calling set_limit “is the same as constructing a new Take instance”. This seems undesirable for hopefully obvious reasons.

After writing this, I realise that the fix that I committed is wrong, since the end of TakeSeek might be beyond the actual end of the parent stream, and so SeekFrom::End needs to use the minimum of the Take end and the actual stream end. :face_exhaling:

Seek also has a very clear API where you can position the cursor from an offset. When you combine the two though it is a slice/view API. If the seek method doesn't properly take into account the bounds, then the API gets very confusing.

In the same way that offsets recorded within a file may be thought of as absolute or relative, I’m not sure there is any definitively correct non-confusing choice. It would also be confusing to see some error about failing to parse at byte 64, when binrw is actually trying to parse byte 60064 of the original stream, just because the user was trying to limit how many bytes to read.

data.seek(SeekFrom::Start(1)).unwrap();
let mut section = data.take_seek(4);
assert_eq!(section.stream_len().unwrap(), 4); // <--- Performs X = Current(0), End(0), Start(X).

At the risk of going on ad nauseam, the design intent of TakeSeek is just to make it possible for a truncating read to occur so someone can e.g. use binrw::helpers::until_eof to read a limited list of objects from a stream. In this example, stream_len should return 5, because the only thing TakeSeek does is truncate the stream.

Given the problems with conflicts in the std::io::Take limit function definition, set_limit semantics, and potential confusion caused by adjusting the stream origin, I would prefer not to have TakeSeek do this, and instead give sub-stream functionality a different name such that the semantics can be fully defined by binrw and add whatever additional traits or helper methods would be useful for such functionality. The documentation for TakeSeek can certainly also be improved to clarify that it is a truncating reader, not a thing that creates sub-streams with their own local origins, to reduce confusion in this regard.

Hopefully this makes sense, but please let me know if there is some other thing I am not thinking about.

Maybe fn take_seek_from(mut self, pos: SeekFrom, limit: u64) -> TakeSeek<Self>?

Sure, something with a _from suffix seems reasonable to me for this.

Thanks,