ronomon / deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
MIT License
71 stars 9 forks source link

Shift-Right in Gear rollsum messes with resynchronisation after changes. #7

Closed dbaarda closed 3 years ago

dbaarda commented 3 years ago

This implementation of FastCDC has changed the Gear rollsum to use a right-shift instead of a left-shift. This is done so that the least-significant bits of the hash include data from all 32 previous bytes in the rolling window, so that "zero-padding" of the hash judgement mask is not required.

However, this is problematic because it means bytes don't fully "expire" from the hash, so it no longer only includes the last 32 bytes, and bytes from the very start of the block can modify the hash. This is because the addition in the Gear rollsum also propagates some bit changes upwards, so the shift-right might never fully expire a byte out of the hash. In some ways this improves the entropy/distribution of the hash, but it also means a change early in a chunk can change the break point at the end, messing with resynchronization.

One way to address this would be to replace the addition with an xor, so that bit changes don't propagate upwards. This would make it like a simplified buzzhash. However, this would further weaken the hash, and Gear is already very weak.

The number of bits in the Gear rollsum directly limits how many bytes are included in the sliding window. Reducing the number of bits from 32 to 31 directly reduces the sliding window size. The Gear rollsum is already a not particularly strong rollsum, so this could matter. In languages that don't have uint32 types this might be unavoidable, but copying this in languages without that limitation is unnecessarily limiting the rollsum strength.

This implementation has been widely copied by other implementations, so this problem affects at least the following implementations as well;

For the record, FastCDC's zero-padding solution when using a left-shift is also IMHO sub-optimal. We know that the most significant bits include the most information about the past 32 bytes, so the best solution is simply use the N most significant bits for a 2^N expected size. This can be done with a mask to check the upper bits are zero as in the FastCDC paper like this;

  MASK = (-1 << (32-N))

  if (!(hash & MASK))
    <it's a chunk boundary>

But a shift-right is probably just as fast and an even simpler way to do the same thing;

  SHIFT = 32 - N

  if !(hash >> SHIFT):
     <it's a chunk boundary>

However, just as fast but even more flexible is to check if the hash value is smaller than a threshold. This allows arbitrary expected size values, not just powers-of-2. It uses all the bits in the hash and treats the most significant bits as most important. This matches nicely with the characteristics of the Gear rollsum;

  THRESHOLD = 2^32 / tgt_size

  if (hash < THRESHOLD):
     <it's a chunk boundary>

Note the above suggestions assume uint32 support. They will need obvious modifications if using a 31 bit hash.

dbaarda commented 3 years ago

After thinking about this a bit more, older values will expire from the hash in a probabilistic way when using a right-shift. This means the effective window size is variable, but the probabilities of it getting large are actually small. There could still be degenerate cases that make it very large, but on average, assuming random behavior, it should be OK.

Using a right-shift, the hash becomes the most significant 32 bits of a very large sum of values, where the oldest byte is adding a 32 bit number to the sum. Assuming random numbers, there is a 50% chance that this propagates a bit change to to the 33rd bit, 25% change to the 34th bit, etc. So the chance that it propagates a bit change into the most significant 32 bits used for the hash is 1/2^(len-32), where len is the number of bytes included in the sum. So there is a less than 0.4% chance of the effective sliding window being larger than 40 bytes.

So I think using a right-shift is probably OK, and the effective increase in the sliding window size probably helps more by strengthening the hash more than the problems caused by the low chance of it getting too big. It's probably worth testing.