ronomon / deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
MIT License
71 stars 9 forks source link

Observations and Recommendations using this chunker. #8

Open dbaarda opened 3 years ago

dbaarda commented 3 years ago

This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Observations related to this FastCDC implementation are;

  1. The average argument is not really the average block length. The actual average is very close to minimum + average/2 (for reasons explained below).
  2. This implementation changes the FastCDC "transition point" to max(0, average - 1.5*minimum). This means that the transition point is less than minimum unless minimum < 0.4*average.
  3. FastCDC gets most of its speed advantages from "cut-point-skipping" past minimum, and recommends minimum = 0.5*average or even minimum=average. This means for the FastCDC recommended minimum settings that give good speed, the first mask is never used, and only the second mask is used.
  4. When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with average set to half the value. An exponential chunker's actual average block length (excluding truncation from maximum) is minimum + average, hence this chunker's actual average being minimum + average/2 (provided maximum is set large enough to minimize truncations).
  5. Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.

With all of that in mind, based on my testing, for this chunker the best minimum/average/maximum settings for optimal deduplication with good speed would be minimum=average/2, maximum>2.5*average with an average chunk length of average. For faster chunking speed with slightly reduced deduplication use minimum=average, maximum>3*average with an average chunk length of 1.5*average. Unfortunately this chunker throws an error if minimum >= average, so it doesn't let you do that. However, you can get away with setting minimum = average -1.