Closed micolous closed 8 months ago
Hi! I don’t see an issue with binrw itself in this report so I’m not sure I understand what is being requested here. Could you clarify what outcome you’re looking for? Was this meant to be opened as a Q&A discussion instead of a ticket? Let me know. Thanks!
This might be question pointing at a documentation gap, or it could be a limitation of the existing API. I'm not sure that it's fixable without some buffering or an API change; or if this is just a limitation a developer needs to keep in mind.
The cookbook for validating checksums on read and calculating checksums on write both use the map_stream
option. Those take a struct as a parameter which implements std::io
's (or an internal equivalent on no_std
) Read + Seek
and/or Write + Seek
traits.
My implementation of a checksum (based on that cookbook) in my first comment follows that same pattern: whenever there's a read, call Digest::update
on that read data, and whenever there's a write, call Digest::update
on the written data. The Digest::update
pattern on partial data could be used by any hashing algorithm which relies on sequential data access.
The reader_var
test case also follows this pattern, though it only implements Read + Seek
and uses a checksum algorithm where order of bytes doesn't affect the result.
That pattern makes an assumption, either:
SUM(bytes) as u8
like the one in the reader_var
test)I've got a highly-contrived example here.
In it, I've put in some seeks and backtracking in there to simulate something like what FilePtr
or calc()
fields may do; so this violates the assumption that reads and writes are sequential, so the checksum generated on the read and write paths are both wrong (and differently wrong!)
One other way around it in my highly-contrived example would be to treat the Data::payload
as a Vec<u8>
, load it entirely into memory and parse it as a Payload
in a second pass.
Thank you for your thoughts. I think there is a misunderstanding about the documentation that might be causing some confusion here. https://docs.rs/binrw is a reference guide, not a cookbook. Examples aren’t recipes; they are demonstrations to help authors understand how a feature might be used. They aren’t intended to cover every possible situation.
The implementation of the checksumming stream in the documentation is deliberately elided because the goal is to show how map_stream
could be used, not how to write an implementation of std::io::Read + std::io::Seek
. If the data you are parsing isn’t compatible with using map_stream
to checksum bytes the same time they’re being read, that’s fine, it just means you need to do something different that matches the original implementation more closely.
Depending on what is covered by the MAC—i.e. whether or not whatever you’re seeking to during parsing is supposed to be included or not—you could just track the last hashed byte position in your stream implementation and only add bytes to the hash when the position being read matches.
You could also use wrapper types like:
#[derive(BinRead)]
#[br(stream = s, map_stream = HashStream::new)]
struct Hashed<T> where T: BinRead {
inner: T,
#[br(calc(s.hash()))]
computed_hash: [u8; 32]
}
#[derive(BinRead)]
struct StoredHash<T> where T: BinRead {
stored_hash: [u8; 32],
#[br(assert(stored_hash == value.computed_hash))]
value: Hashed<T>,
}
(Use temp
, map
, and parse_with
as desired to reduce the number of wrappers.)
And then avoid non-sequential reads by storing the offsets and lazy-loading or using some of the other helpers from the file_ptr
module.
Or some combination of these approaches.
However, if you are trying to do a verify-then-parse, you will never be able to avoid two passes since you can’t start parsing until you have verified the message. In this case, all of this is moot.
In any case, I don’t know of anything binrw can really do to make this easier since it doesn’t seem like you’re describing something that can be solved more easily than it already is in the generic case, and even if there was reasonable way of mapping a stream only for some fields and then consuming the stream to retrieve a value from it (I can’t think of a way to do this that wouldn’t be shitty, since whatever grammar is used needs to be compatible with a normal Rust struct grammar), that still won’t get you what you want when it comes to sequentially hashing bytes that aren’t being read sequentially.
Let me know if this makes sense or you have any other questions. I’ll convert this to a discussion since it isn’t reporting a specific defect in binrw but is more asking a question about how to parse a particular data format which is what discussions are best for. Thanks!
I'm implementing a reader and writer for an existing file format which is laid out something like this:
Based on the
map_stream
examples (which have a hash at the end of the file), I've ended up with this:This works, but are edge cases with this approach:
read()
time assumes that reads are always sequential from byte 32 onwards, and it is neverseek()
ed to a position after byte 32.write()
time also assumes that writes are always sequential from byte 32 onwards, and it is neverseek()
ed to a position after byte 32I think the
reader_var
test cases also makes the assumption of sequential reads, though it's got the checksum field at the end.Tangential to this, it'd be nice to be able to use this to verify a MAC, and do that verification before deserialising other data structures. However, this would require two read passes on a file, or to buffer the entire file (which could be large) in memory.