Using map_stream for checksums assumes that reads and writes are sequential

micolous commented 8 months ago

I'm implementing a reader and writer for an existing file format which is laid out something like this:

struct Data {
  checksum: [u8; 32], // SHA256 of `payload` at the start of the file
  payload: Vec<u8>, // actually a different structure, but I've made it `Vec<u8>` here for simplicity
}

Based on the map_stream examples (which have a hash at the end of the file), I've ended up with this:

#[binrw]
#[derive(Debug, Default, Clone)]
#[brw(big, stream = r, map_stream = Checksum::new)]
pub struct Data {
    #[brw(pad_before = 32)]
    #[br(parse_with = until_eof)]
    pub payload: Vec<u8>,

    #[brw(seek_before = SeekFrom::Start(0))]
    #[br(temp, assert(checksum == r.check(), "bad checksum: {:x?} != {:x?}", checksum, r.check()))]
    #[bw(calc(r.check()))]
    pub checksum: [u8; 32],
}

struct Checksum<T> {
    inner: T,
    sha256: Sha256,
    p: u64,
}

impl<T> Checksum<T> {
    fn new(inner: T) -> Self {
        Self {
            inner,
            sha256: Sha256::new(),
            p: 0,
        }
    }

    fn check(&self) -> [u8; 32] {
        self.sha256.clone().finalize().into()
    }
}

impl<T: Read> Read for Checksum<T> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        let size = self.inner.read(buf)?;
        if self.p < 32 {
            // Don't hash the first 32 bytes of the file.
            // Position of byte 32 in this buffer
            let x = (32 - self.p) as usize;

            if x < size {
                // Byte 32 is in this buffer
                self.sha256.update(&buf[x..]);
            }
        } else {
            // We're past the first 32 bytes of the file.
            self.sha256.update(&buf[0..size]);
        }
        self.p += size as u64;
        Ok(size)
    }
}

impl<T: Seek> Seek for Checksum<T> {
    fn seek(&mut self, pos: SeekFrom) -> std::io::Result<u64> {
        self.p = self.inner.seek(pos)?;
        Ok(self.p)
    }
}

impl<T: Write> Write for Checksum<T> {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        let size = self.inner.write(buf)?;
        if self.p < 32 {
            // Don't hash the first 32 bytes of the file
            // Position of byte 32 in this buffer
            let x = (32 - self.p) as usize;

            if x < size {
                // Byte 32 is in this buffer
                self.sha256.update(&buf[x..size]);
            }
        } else {
            // We're past the first 32 bytes of the file.
            self.sha256.update(&buf[..size]);
        }
        self.p += size as u64;
        Ok(size)
    }

    fn flush(&mut self) -> std::io::Result<()> {
        self.inner.flush()
    }
}

This works, but are edge cases with this approach:

hashing at read() time assumes that reads are always sequential from byte 32 onwards, and it is never seek()ed to a position after byte 32.
hashing at write() time also assumes that writes are always sequential from byte 32 onwards, and it is never seek()ed to a position after byte 32

I think the reader_var test cases also makes the assumption of sequential reads, though it's got the checksum field at the end.

Tangential to this, it'd be nice to be able to use this to verify a MAC, and do that verification before deserialising other data structures. However, this would require two read passes on a file, or to buffer the entire file (which could be large) in memory.

csnover commented 8 months ago

Hi! I don’t see an issue with binrw itself in this report so I’m not sure I understand what is being requested here. Could you clarify what outcome you’re looking for? Was this meant to be opened as a Q&A discussion instead of a ticket? Let me know. Thanks!

micolous commented 8 months ago

This might be question pointing at a documentation gap, or it could be a limitation of the existing API. I'm not sure that it's fixable without some buffering or an API change; or if this is just a limitation a developer needs to keep in mind.

The cookbook for validating checksums on read and calculating checksums on write both use the map_stream option. Those take a struct as a parameter which implements std::io's (or an internal equivalent on no_std) Read + Seek and/or Write + Seek traits.

My implementation of a checksum (based on that cookbook) in my first comment follows that same pattern: whenever there's a read, call Digest::update on that read data, and whenever there's a write, call Digest::update on the written data. The Digest::update pattern on partial data could be used by any hashing algorithm which relies on sequential data access.

The reader_var test case also follows this pattern, though it only implements Read + Seek and uses a checksum algorithm where order of bytes doesn't affect the result.

That pattern makes an assumption, either:

reads and writes of checksummed data are always sequential, and there are no seeks (for algorithms where the order of bytes affects the output, eg: MD5, SHA1, SHA2, etc.)
every byte is only read or written once (or writes of 0 don't matter), but the order of those bytes doesn't matter for your checksum (eg: if your checksum is just SUM(bytes) as u8 like the one in the reader_var test)

I've got a highly-contrived example here.

In it, I've put in some seeks and backtracking in there to simulate something like what FilePtr or calc() fields may do; so this violates the assumption that reads and writes are sequential, so the checksum generated on the read and write paths are both wrong (and differently wrong!)

One other way around it in my highly-contrived example would be to treat the Data::payload as a Vec<u8>, load it entirely into memory and parse it as a Payload in a second pass.

csnover commented 8 months ago

Thank you for your thoughts. I think there is a misunderstanding about the documentation that might be causing some confusion here. https://docs.rs/binrw is a reference guide, not a cookbook. Examples aren’t recipes; they are demonstrations to help authors understand how a feature might be used. They aren’t intended to cover every possible situation.

The implementation of the checksumming stream in the documentation is deliberately elided because the goal is to show how map_stream could be used, not how to write an implementation of std::io::Read + std::io::Seek. If the data you are parsing isn’t compatible with using map_stream to checksum bytes the same time they’re being read, that’s fine, it just means you need to do something different that matches the original implementation more closely.

Depending on what is covered by the MAC—i.e. whether or not whatever you’re seeking to during parsing is supposed to be included or not—you could just track the last hashed byte position in your stream implementation and only add bytes to the hash when the position being read matches.

You could also use wrapper types like:

#[derive(BinRead)]
#[br(stream = s, map_stream = HashStream::new)]
struct Hashed<T> where T: BinRead {
  inner: T,
  #[br(calc(s.hash()))]
  computed_hash: [u8; 32]
}

#[derive(BinRead)]
struct StoredHash<T> where T: BinRead {
  stored_hash: [u8; 32],
  #[br(assert(stored_hash == value.computed_hash))]
  value: Hashed<T>,
}

(Use temp, map, and parse_with as desired to reduce the number of wrappers.)

And then avoid non-sequential reads by storing the offsets and lazy-loading or using some of the other helpers from the file_ptr module.

Or some combination of these approaches.

However, if you are trying to do a verify-then-parse, you will never be able to avoid two passes since you can’t start parsing until you have verified the message. In this case, all of this is moot.

In any case, I don’t know of anything binrw can really do to make this easier since it doesn’t seem like you’re describing something that can be solved more easily than it already is in the generic case, and even if there was reasonable way of mapping a stream only for some fields and then consuming the stream to retrieve a value from it (I can’t think of a way to do this that wouldn’t be shitty, since whatever grammar is used needs to be compatible with a normal Rust struct grammar), that still won’t get you what you want when it comes to sequentially hashing bytes that aren’t being read sequentially.

Let me know if this makes sense or you have any other questions. I’ll convert this to a discussion since it isn’t reporting a specific defect in binrw but is more asking a question about how to parse a particular data format which is what discussions are best for. Thanks!

jam1garner / binrw

Using map_stream for checksums assumes that reads and writes are sequential #235