declarative parsing and writing

ScanMountGoat commented 8 months ago

I've been experimenting with making the reading and writing code for modern bdat files more self documenting using binrw. I've linked a gist for code with binrw 0.13.3 that can read and rebuild all bdat files in Xenoblade 3 1:1 with the originals. This code doesn't include converting to and from the ModernTable type since it looks like you're reworking that on a separate branch. It should also be doable to store enum values instead of bytes for each row and still rebuild the data correctly. https://gist.github.com/ScanMountGoat/a406b36a8d103eed035663ec42c3b9f5

The implementation is a lot shorter, but the main advantage in my opinion is making the code easier to reason about. The basic process for reading would be to parse the bdat data into an owned Bdat. Writing this back to the file without any changes should produce identical bytes. Converting the Bdat to a friendlier representation like ModernTable can be done with from_ and to_ conversion functions. Splitting the read/write logic and conversion logic also means the conversions can be tested by comparing the Bdat structs instead of diffing binary files.

If we want the types to be an implementation detail, we could just have functions that take a reader or bytes and return Vec<ModernTable, Vec<LegacyTable, or an enum with both. This would avoid all the current complexity on the user side of dealing with opaque readers or slices that will probably get converted immediately to tables anyway. The bdat files are tiny, so the performance cost of parsing and converting all tables is low.

ScanMountGoat commented 8 months ago

This doesn't account for endianness and the naming could be better, but this is the general idea of what I think user code could look like.

let bdat = bdat::from_bytes(bytes)?;
match bdat {
    bdat::Bdat::Legacy(tables) => (),
    bdat::Bdat::Modern(tables) => ()
}
bdat.write(writer)?;

let legacy_tables = bdat::legacy::from_bytes(bytes)?;
legacy_tables.write(writer)?;

let modern_tables = bdat::modern::from_bytes(bytes)?;
modern_tables.write(writer)?;

roccodev commented 8 months ago

I'll have to experiment with it, especially for legacy formats. Thanks!

roccodev commented 7 months ago

If I understand correctly, it seems that this implementation could replace the current io::legacy and io::modern modules. After 0.5 is released, I'll check whether this is feasible for legacy formats.

Right now, legacy formats are the bulk of the library's complexity (both parsing and type-wise), which is why I wanted to separate the two formats as much as possible in 0.5.

ScanMountGoat commented 7 months ago

I would personally define separate types for the legacy and modern bdats. It looks like the fields and types may be slightly different based on the docs. The modern binrw code code could at least be a starting point for both versions. Deriving the reading and writing logic and using owned instead of borrowed data should at least reduce a lot of the code complexity in io::legacy and io::modern.

roccodev commented 7 months ago

Yeah that was my intention. I'm a bit puzzled on why you would use owned data, though. Currently, lifetimes in the public API can be elided in function signatures, so they only complicate the API in struct fields. They give users the flexibility to either keep the borrow or clone the data to extend its lifetime, which is noticeable when transcoding large message BDATs to JSON or CSV. Admittedly, it's not as beneficial in legacy formats as you need to clone the data anyway if you don't own a mutable source (because text is originally scrambled)

ScanMountGoat commented 7 months ago

How expensive is cloning the entire bdat for the largest file?

roccodev commented 7 months ago

Not much, from very rudimentary tests it seems to be a couple hundred ms at most on a single-threaded extract on the entire XC3 2.2.0 dump. I admit I might have given the performance factor too much focus in my initial analysis.

Still, accepting borrowed data is consistent with other libraries' (most notably serde) data model. Obviously I'm not aiming for total zero-copy as endianness may vary and values are not properly aligned, but being able to avoid copies for strings is quite nice. In the future, I might also import the hash table from the BDAT source directly for modern tables (up to endianness) if mutable access isn't required (with clone-on-write semantics), so that's another thing that could be borrowed from the data.

I might also experiment with a custom Serializer/Deserializer for BDAT, so you can model BDAT rows as structs using serde. Some types (i/u64, f64, etc.) may be unsupported and some might alias, so I need to see how to best integrate that with the BDAT type system.

roccodev / bdat-rs

declarative parsing and writing #9