jam1garner / binrw

A Rust crate for helping parse and rebuild binary data using ✨macro magic✨.
https://binrw.rs
MIT License
545 stars 35 forks source link

Abiliity to calculate serialized size when writing - Feature request #269

Open apps4uco opened 1 month ago

apps4uco commented 1 month ago

Hi, firstly thanks for the crate, it is the most ergonomic solution I have found to read and write binary data.

In my use case I need to write data in variable size structures and also later in the file format have other structures with pointers/offsets to the first data.

One option at least for streams that implement Seek is to query the stream for the current offset using stream_position(), however, that seems like it might not be too efficient, and also it has to be stored.

Currently I have solved calculating the positions of the data by manually implementing the following trait and keeping a running total of bytes written (or that will be written).

pub trait SerializedSize {
    fn serialized_size(&self) -> usize;
}

However, it is error prone as one has to manually sum up the sizes of the data structures, taking into account optional fields, variable length data structures ie vec and magic number sizes.

It would be really useful if the macros could implement such functionality automatically. If this is already possible and I missed it in the documentation, maybe it could be put in a more prominent place.

I realize it may be complicated to calculate the size with nested structures and enums, so possibly one could use a marker attribute to indicate when this functionality is required and if the data structure is too complex error out at compile time.

Thanks in advance

csnover commented 3 weeks ago

Hi, thanks for your report!

The longstanding general issue for handling offset writing is at #4 and then also a request for easier macro positioning at #161. This is also a question that has come up before in e.g. #264. I have updated that first ticket title to hopefully be clearer since I see a lot of reports this week about it :-)

All streams used with binrw have to implement Seek anyway, and stream position is already queried frequently for error handling, so I wouldn’t worry about efficiency asking for size from the stream unless you are measuring it and it is a problem. I am not sure that would ever actually be faster to calculate sizes separately than to use a stream wrapper that updates a single u64, since such a wrapper is doing almost no work (but I have not measured anything and have been wrong before).

The best case I can argue for such a feature would be that it could enable single-pass writes in some cases where it would otherwise be necessary to emit placeholder values and then seek back to fill them in later, but even then there are additional considerations. To avoid bloating codegen times this should probably be an opt-in feature, and even then I am unsure the benefits outweigh the limitations. As you note, anything that does something non-trivial (e.g. write_with) would not be able to calculate the size in advance, and then you are back to having to query the stream again anyway. I’m a big fan of generalisable solutions to problems until specialisation is needed and it is not 100% clear to me right now that doing more work in binrw in order to avoid performance issues with stream position tracking is warranted. Let me know if you have some data on this, though! Otherwise, certainly the issue of needing to write offsets is known and tracked in #4.

Thanks,

apps4uco commented 3 weeks ago

I understand your reasoning, and Im not sure how I missed the other issues.

Just to clarify my intended use case, I wanted to decouple the generating of the structs from the writing of them, however, there are offsets in some structs that reference other structs, and they get written to different parts of the file. ie its not a case of sequential writes.

One of the workarounds I considered was to append to vectors in memory and then write them in the correct places in the output file, however this means I have the data structures in memory and also the serialized output. The second one which is what I opted to implement is to write a separate file for each stream, and then concatenate the files. At least then I only have the data structures in memory. However, I dont consider the solution to be elegant, as I write to disk and then read from disk.

The solution I wanted to implement was to create an iterator of data struct A use scan to build up the offsets, and then zip those with another iterator of struct B (which has the offsets to struct A) and then chain them together to write the final file.

Feel free to close this issue and track it in one of the others you referenced.

Thanks once again for the excellent crate.