pascalkuthe / imara-diff

Reliably performant diffing
Apache License 2.0
106 stars 9 forks source link

Opting out of token interning #9

Closed osiewicz closed 10 months ago

osiewicz commented 10 months ago

Hey, thanks for making this crate. :) Would it be possible to call diff without having to intern the input first? In my use case (character-wise diffing) interning doesn't seem necessary, as chars should be as cheap to compare as Tokens (that are just u32's under the hood) - and interning has it's non-trivial cost.

Thanks!

pascalkuthe commented 10 months ago

you can use diff_with_tokens for providing arbitrary tokens with your own interning (or no interning at all). Tokens are just u32 newtype wrappers (with a pub inner field) so you can easily provide you own tokens (for example by just converting chars to u32). However, you will need to allocate vectors since diffing fundamentally needs random access.

Note that the histogram algorithm fundamentally requires interning (even for char diffs sice it needs a low cardinality input set) but for char diffs I would expect Myers to produce more appropriate results anyway. If you only use Myer diff then the value of num_tokens doesn't matter (its not used) but the correct value would probably be u32::MAX (or the maximum unicode codepoint plus 1).

If you can guarantee that your input is only ascii (or some reasonable other unicode subset) then you could also used histogram and pass 128 for num_tokens