ronomon / deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
MIT License
71 stars 9 forks source link

distinguish between forced and natural cut points #1

Closed NodeGuy closed 7 years ago

NodeGuy commented 7 years ago

Fantastic open source contribution, thank you!

I'm confused about the purpose of flags. Perhaps I'm missing something obvious, but would you mind explaining why the last chunk in a non-last source buffer isn't hashed and reported?

https://github.com/ronomon/deduplication/blob/47e077cdb39437a4672935b2564881490f7bd95c/binding.js#L192

NodeGuy commented 7 years ago

Never mind, I figured it out.

NodeGuy commented 7 years ago

I'm back again. I see a potential rare bug.

https://github.com/ronomon/deduplication/blob/47e077cdb39437a4672935b2564881490f7bd95c/binding.js#L115-L119

If hash & mask2 is 0 right at the end of a non-last source buffer boundary then it won't be reported. It wouldn't be the end of the world but it's incorrect behavior.

jorangreef commented 7 years ago

Thanks @NodeGuy, that's a great find!

The purpose of flags is to differentiate between EOF and non-EOF source buffers so that we can support streaming. If we only have a few bytes remaining in a non-EOF source buffer, then we want to avoid emitting an artificial cut-point simply because we hit the end of the source buffer.

As you discovered, the problem is that we don't differentiate between forced cut-points and natural cut-points which by chance happen to fall on the maximum boundary. We would then not emit a real natural cut-point, and only emit it on the next call to deduplicate(). This would cause slightly more work than necessary, and could even lead to an infinite loop if the following assertion were ever removed and if sourceSize were ever equal to maximum:

https://github.com/ronomon/deduplication/blob/49042376ceab58054c3ea5266f5f6dc5477142f2/binding.js#L163-L165

My first attempt to fix this was to break when minimum or less bytes are available in the non-EOF source buffer. However, this would then emit chunks prematurely when the end of the non-EOF source buffer is reached.

I think my second attempt handles all the edge cases correctly now.

What do you think?

jorangreef commented 7 years ago

My third attempt cleans it up by moving the break-point decision into cut() where there is more context available.

NodeGuy commented 7 years ago

Sorry I didn't respond earlier, life got in the way.

Right off the bat I see a problem: It looks like if this line is executed it will be in an infinite loop

https://github.com/ronomon/deduplication/blob/d3dbb9512efbc27d801a3d636ce6a235be5fd3a0/binding.js#L102

because of this

https://github.com/ronomon/deduplication/blob/d3dbb9512efbc27d801a3d636ce6a235be5fd3a0/binding.js#L194

As I dug into the code further I consulted the FastCDC paper and find myself wanting the implementation to match the paper more closely to ease debugging and to provide greater assurance of correctness.

Would you be willing to drop the JavaScript reference implementation so that we can use 64 bit integers?

jorangreef commented 7 years ago

Right off the bat I see a problem

No, it's not an infinite loop. Those two lines work together. If the source buffer is not the last buffer in the stream, then we need to stop processing the source buffer as soon as we know that we don't have enough data to find a decent cut point (i.e. remaining bytes are less than the minimum invariant). There's no case where a chunk returned by cut() can be 0 in size, so we use 0 to signal that we should read more into the source buffer rather than return a cut point. We could have used -1 as a return value to indicate the same condition (see the previous commit) but that complicated the C++ implementation, also requiring casting etc. It's safer sticking with a single unsigned integer and making the breakpoint decision in cut() which has more context.

As I dug into the code further I consulted the FastCDC paper and find myself wanting the implementation to match the paper more closely to ease debugging and to provide greater assurance of correctness.

The implementation matches the paper except where indicated and I think those variations are for the better. They're not really algorithmic changes so much as model parameter changes in any event and the FastCDC paper actually provides a range of parameters to choose from (e.g. NC1 vs NC2 - we pick NC1 for better ratio given an average chunk size of 64 KB). cut() provides far more in the way of boundary checks and assertions etc. These are not covered in the FastCDC paper and of course the paper is not expected to get into implementation details. FastCDC also actually does not provide any reference implementation for streaming - which is probably the only difference between cut() and the pseudocode in the paper.

There's not enough pseudo code in the paper such that implementing it exactly would give me much assurance of correctness. That's what the fuzz test and benchmark are for.

Would you be willing to drop the JavaScript reference implementation so that we can use 64 bit integers?

It's great to have a reference implementation in Javascript especially for browser usage (to speed up and resume user uploads). The Javascript reference implementation is also critical for having two independent implementations which we can fuzz and compare byte-for-byte and assert. I would encourage you to go through the fuzz test (run node test.js from within the module directory) to see how we test both implementations against each other. We test things like whether the source buffer remained untouched etc. I think the coverage is almost extreme but if there is anything it's not already testing please let me know. We have been using the module in production for several weeks now and it's been working really well.

NodeGuy commented 7 years ago

https://github.com/ronomon/deduplication/blob/d3dbb9512efbc27d801a3d636ce6a235be5fd3a0/binding.js#L194

You're right, my tired mind rewrote break to read continue last night for some reason and I saw an infinite loop that doesn't exist.

The implementation matches the paper except where indicated and I think those variations are for the better.

OK

It's great to have a reference implementation in Javascript especially for browser usage (to speed up and resume user uploads).

That's cool!

jasonwatkinspdx commented 3 years ago

The implementation matches the paper except where indicated and I think those variations are for the better. They're not really algorithmic changes so much as model parameter changes in any event ...

Hi, I just came across this researching the differences between the paper and the rust implementation that is a port of this one. I believe the change you made to the shift operator is more substantial than the quote above suggests.

A basic property of this category of algorithms is that the fingerprint at a given byte of input depends only on the window of previous bytes. In Gear/FastCDC this is ensured by using a left shift and addition in the inner loop. This means it's not possible for byte N of the input to have any impact on the least significant digit of the fingerprint for byte N + 1 onward. Once sufficient shifts have been done, any potential residue of byte N is gone, ie at byte N + 32 or N + 64. On the other hand, using a right shift means that a residue from the very first calculation persists all the way to the last byte.

Did you do an empirical evaluation before and after these changes, particularly for the cases of insertions or deletions? I suspect you've impacted the deduplication ratios with this change in such. I wouldn't take it for granted that the behavior in the paper evaluations still holds with this change. Am I missing something that mitigates this concern?

Additionally I'm a bit baffled at needing to avoid the modulus operation. It's modulo a power of two which is a simple mask operation, which should have negligible performance impact.

jorangreef commented 3 years ago

A basic property of this category of algorithms is that the fingerprint at a given byte of input depends only on the window of previous bytes.

Sure, and I might be wrong, but as far as I understand left and right shifts... the sliding window is fixed N bits, you add entropy in, and you shift a bit out. Whether the bit is shifted out the right side of the window, or out the left side of the window, it's shifted out. The remaining difference between left and right shift then comes down to that a left shift will increase the size of your integer requiring a mod to keep your sliding window fixed at N bits, whereas if your sliding window is already N bits, then a right shift simply discards the right-most bit avoiding a mod entirely (even a cheap modulo power of two).

I think this is how the theory has it, but if you're not sure, you could always take this implementation and substitute the right shift for a left shift plus modulo and compare.

Did you do an empirical evaluation before and after these changes, particularly for the cases of insertions or deletions?

Yes. The benchmark measures ratio.

Additionally I'm a bit baffled at needing to avoid the modulus operation. It's modulo a power of two which is a simple mask operation, which should have negligible performance impact.

The implementation is designed not only for fast C performance, but also for fast JavaScript performance. With JavaScript, every operation you don't need to do makes a difference, even a cheap power of two mask. The other thing to remember is that (as far as I understand and I might be totally outdated here but) JavaScript engines use integer boxing, so a 31-bit integer is considerably more efficient than a 32-bit integer, because the 32nd bit is reserved as a flag to indicate a larger boxed integer. With the left shift, I think it's possible that the integer would need to be re-boxed as it crosses the 31-bit threshold, and the mod would bring it back to 31-bits, but this reboxing would probably not be optimized away. Who knows... I think my measurements at the time showed a difference, but it would be awesome if you want to tinker and see if you can beat the JavaScript version!

jasonwatkinspdx commented 3 years ago

Whether the bit is shifted out the right side of the window, or out the left side of the window, it's shifted out.

This is a misunderstanding. They aren't symmetric because carry bits only propagate upward. This means the left shifted prior value cannot effect the least significant bit, as explained above. With a right shift, the residue stays within the calculation indefinitely (in fact the math is very similar to an IIR filter). Look at equation 2 in the paper for Rabin, and note that the calculation of a fingerprint at a given byte only depends on previous bytes within the window alpha. Then look at figure 2 and note that the Gear hash has the same property by virtue of this left shift.

Perhaps your changes work for your purposes, but this is not the algorithm in the paper. The change is material in a way that invalidates all the assumptions the paper makes.

I looked at the tests, but I didn't see a test that specifically tests the boundary shift problem, the motivation for using this algorithm in the first place.

Doing all the mathematics modulo the finger print size is not a bug. It's a deliberate feature of this algorithm. In JS the mask is very likely free vs memory/disk bandwidth. Likewise Bigint has been a thing for a while no?

jorangreef commented 3 years ago

Are you suggesting that a bitwise right shift does not shift bits to the right?

jorangreef commented 3 years ago

I think you're also thinking of an arithmetic right shift, whereas the code is doing a logical right shift (because we're using an unsigned type not a signed type, which would technically be UB even though compilers would do an arithmetic right shift for a signed type):

https://en.wikipedia.org/wiki/Bitwise_operations_in_C#Right_shift_%3E%3E

The Wikipedia article shows the binary bit representation and how these are moved to the right by the shift, and shifted out.

jorangreef commented 3 years ago

In JS the mask is very likely free vs memory/disk bandwidth.

There's a huge CPU cost to chunk cut-point detection (and crypto), which these days is in the same order of magnitude as memory bandwidth (6 GB/s per core) and SSD sequential write bandwidth (500 MB/s). Rabin fingerprinting is seriously expensive, which is why FastCDC is such a breakthrough, and part of the reason for that was less operations. Again, I measured real performance improvements for our additional optimizations on top of FastCDC (especially in the browser with JS because of boxing/unboxing), but you're welcome to run your own benchmarks.

There's no theoretical need for the left shift and extra modulo (the window is a bit window whether you shift left or right, provided the shifts are logical - left/right is not essential to the FastCDC paper, provided your mask padding takes account of the direction you're using), but if you feel more comfortable adding it back in to replace the right shift, then go for it.

jorangreef commented 3 years ago

Likewise Bigint has been a thing for a while no?

Yes but have you ever benchmarked JavaScript's BigInts?

jorangreef commented 3 years ago

With a right shift, the residue stays within the calculation indefinitely (in fact the math is very similar to an IIR filter).

I don't think this is correct. A logical right shift simply shifts bits in the fixed size bit window to the right and out. Don't take my word for it, write a few print statements to print the binary bit representation of a random 32-bit unsigned integer and then see the effect of a left or right logical shift (in JavaScript you would use the >>> 1 operator for a logical right shift by one bit, and you can use (integer).toString(2) to print the binary bit representation). The Wikipedia article I linked to also explains this and shows how the binary bit representations are affected.

Since we reverse the sliding window shifting, we also reverse the padding in the FastCDC masks. There are several variants of FastCDC, even in the paper, and our goal here is to have a fast implementation in JavaScript that runs in browsers, as well as in native C. All of these optimizations are already documented in the README: https://github.com/ronomon/deduplication#content-dependent-chunking