Magic parsing performance

johanholmerin commented 3 months ago

I'm using binrw to parse a format with an enum with around 300 variants(all using magic directives) and with files that can contain up to 40 000 items. The performance when parsing files weren't what I expected, even if I add return_unexpected_error, although that did help. I did some profiling and saw that there's a lot of allocations happening as part of the magic parsing due to the boxed value in the generated error. As an experiment I tried replacing the BadMagic error returned by the magic function with NoVariantMatch(which only contains the position) and saw a +70% decrease in parse time. Considering that the value isn't even used when return_unexpected_error is enabled this seems like a good opportunity for optimization. It may be possible to improve for return_all_errors as well since the possible data types for magic are quite small.

I've created a repo with a simple reproduction. If you replace BadMagic with NoVariantMatch in binrw you should see the difference in performance.

I'm would also be interested if there's any performance to be gained by not re-reading the magic value for each variant. In my case I'm parsing from in-memory, but there's a lot of reading and seeking being done that's unnecessary. It would require a larger change so I've not done any testing but might be worth looking into.

csnover commented 2 months ago

Sorry about the extended delay in this reply, and thank you for profiling and providing a good test case!

While it is possible for magic to be a byte string of arbitrary length, this is unusual, so it does seem reasonable to me at a minimum to use some small inline allocation strategy for this error type instead of always boxing the result.

There is an optimisation for magic that exists in codegen, but only for unit enum. Without thinking too hard, it makes sense to me to do something similar here.

A workaround for now to improve performance could be to use pre_assert instead of magic for each variant, and pass the magic as an argument instead. e.g.:

#[binread]
#[br(import(magic: u16))]
enum Record {
  #[br(pre_assert(magic == 0))]
  Record000,
  #[br(pre_assert(magic == 1))]
  Record001,
  /* … */
}

#[binread]
struct MagicRecord {
  magic: u16,
  #[br(args(magic))]
  record: Record,
}

This is less pleasant, but is at least something you can do right now to avoid both the allocations and the re-reads of the magic.

johanholmerin commented 2 months ago

No worries. I've been able to solve most of the performance problem I had, since as opposed to the test case the variants in my data aren't evenly distributed so I could put the most common ones first(as noted in the docs). There's still some unnecessary overhead but much less than before.

However, having thought some more about this the root of the problem is a mismatch in how I would want/thought magic would work. Currently the parsing will loop over each variant, read the magic value, and if it doesn't match(or if there's some other error parsing the variant), then it will move on to the next variant, until all have failed. Only then will the enum parsing fail. What I would want is instead that the magic is read once, the correct variant is looked up(or an error is returned if there's no match), and then is parsed. I don't know how feasible that would be to implement but would be a big improvement for me. It would also solve a more immediate problem I have which is very long compile times and AFAICT is caused by the large amount of code( and closures) generated by the magic directive.

Apart from this issue binrw has been great really great to work with, thanks for working on it

csnover commented 2 months ago

Sure, I understand the confusion. In binrw, it isn’t necessary for magic to always be the same type (or even to exist at all, for fallbacks). Doing what you expect of reading the magic value once and then matching a variant would be a very desirable optimisation for the case where there are multiple variants that use the same type of magic. This optimisation exists for reading unit enum already (I actually forgot that it was only implemented for those, so was initially a little surprised about this ticket).

binrw codegen certainly should be capable of supporting more optimisations like these without too much trouble, it is simply a matter of having someone motivated to complete the work.

johanholmerin commented 2 months ago

I made an initial attempt at optimizing data enums. There's some errors that would need to be fixed and it needs to cleaned up. It also doesn't support return_all_errors at the moment(not sure how to do that). I did however test it on my project and it works and fixes the performance problem I had completely. Unfortunately the long compile times remain.

jam1garner / binrw

Magic parsing performance #253