niklasf / rust-huffman-compress

A Rust library for Huffman compression given a propability distribution over arbitrary symbols
https://docs.rs/huffman-compress
Apache License 2.0
24 stars 8 forks source link

Allow initialization without creating HashMap #1

Closed mernen closed 6 years ago

mernen commented 6 years ago

This patch replaces the codebook() input from HashMap to anything that implements a similar iterator. Existing invocations using HashMap remain fully compatible.

I did this for two reasons:

  1. It allows initialization using static data without having to build a HashMap (see included test)
  2. The default HashMap state builder (RandomState) is extremely dangerous if you have symbols with equal weights, as they may be visited in a different order on each execution of your program, resulting in different codes for those symbols. My persisted data was getting corrupted, and I had to switch to fnv to get deterministic results across runs.
niklasf commented 6 years ago

Thanks for the patch.

And wow, good catch about RandomState. What do you think about requiring K: Ord and using that to break ties, so it doesn't depend on the implementation details of the hash function? (Marginally related: Then we could also use BTreeMap to remove the K: Hash bound entirely ... probably I should benchmark that.)

One new corner case is now the iterator yielding duplicate keys. Last occurence wins?

mernen commented 6 years ago

Ord seems to be the best option to ensure deterministic behavior. And then I agree at this point it might make sense to switch to BTreeMap internally altogether, but yeah, it’s better to benchmark before making such changes.

As for the corner case, last occurrence does seem like a good strategy — it’s the behavior you’ll typically see when duplicate keys are found in maps (e.g. literals in scripting languages).