Open SaveTheRbtz opened 2 years ago
Hm... never mind -- seems like I've misused the API: using NewWithBoundaries
does not automatically adjusts the splitmask
to be an average of the two (which is not even possible when there there is no power of two between them). Calling SetAverageBits
would likely fix my problem of skewed chunk sizes.
Actually, it is not that simple. On a random input it is indeed working fine but if I use a skewed input then output chunk sizes are heavily skewed towards minSize
(regardless of the polynomyal).
If I change (digest&c.splitmask) == 0
to (digest&c.splitmask) == 1
then chunk size distribution becomes way better.
Problem also goes away if we match on exact number of zeroes instead of greater or equal, e.g.:
bits.TrailingZeros64(digest) == c.averageBits
Looking closer at values of digest
on split, seems like values of 0, 10000, 1000000, 100000000, 10000000000, and 1000000000000 dominate the splits when input is a tar
file (regardless of the polynomial). This is likely due to .tar
format having runs of zeros that are longer than rolling hash's windowSize
(bumping it to >=512 also fixes the problem):
00000000: 7061 785f 676c 6f62 616c 5f68 6561 6465 pax_global_heade
00000010: 7200 0000 0000 0000 0000 0000 0000 0000 r...............
00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
[1] i'm using first 10 MiB of an uncompressed linux kernel tar as an input:
$ git archive --format tar -o ~/tmp/linux-v5.15.tar v5.15
Hm, interesting observation! I don't have much time right now, but I remember that somewhere in the original paper an attribute of the digest algorithm was described: if you feed it blocks of zeroes, the digest stays at zero. That leads to blocks of zeroes yielding always the smalles block size possible. For our use case (restic backup program) with rather large min block sizes that's good, because it deduplicates the "zero block" and just stores it once. Changing that to 1 changes the behavior significantly, if that's good or bad depends on your use case I guess.
Is it possible to change it to either:
bits.TrailingZeros64(digest) == c.averageBits
or a more gentle:
(digest != 0 && (digest&c.splitmask) == 0)
this should not invalidate all the existing chunks but only ones that end with a long run of zeros and hence were erroneously aggregated. IIUC restic
should not be too affected given that its default min/max sizes are enormous and therefore "zero-only" segments would not be de-duplicated only if all of them are between min and max: ones smaller than min
are already skipped, and the ones bigger than max
would be forcefully cut.
I'm fine with proposed changes to this library, but the changes must stay backwards compatible. We need to do any changes so that (without opting in) the behavior must not change (e.g. yield different chunks).
What do you think about the following (API-compatible) change:
func New(rd io.Reader, pol Pol, opts ...Option) *Chunker
With options defined as following:
type Option func(*Chunker)
func WithBuffer(buf []byte) Option {
return func(c *Chunker) {
c.buf = buf
}
}
func WithBoundaries(min, max uint) Option {
return func(c *Chunker) {
c.MinSize = min
c.MaxSize = max
}
}
func WithAverageBits(averageBits int) Option {
return func(c *Chunker) {
c.splitmask = (1 << uint64(averageBits)) - 1
}
}
This would allow to define more options in the future (including one for skipping 0 digests) without the need for more NewWith*
constructors.
As a side benefit some of the methods can be simplified, e.g.:
NewWithBoundaries
:
func NewWithBoundaries(rd io.Reader, pol Pol, min, max uint) *Chunker {
return New(rd, pol, WithBoundaries(min, max))
}
Reset
:
func (c *Chunker) Reset(rd io.Reader, pol Pol, opts ...Option) {
opts = append(opts, WithBuffer(c.buf))
*c = *New(rd, pol, opts...)
}
ResetWithBoundaries
:
func (c *Chunker) ResetWithBoundaries(rd io.Reader, pol Pol, min, max uint) {
c.Reset(rd, pol, WithBoundaries(min, max))
}
A draft: https://github.com/restic/chunker/pull/37 (all existing tests pass w/o modification)
Yeah, I'm familiar with this options pattern, and I like it :)
Thanks for working on this!
I've tried to use restic's chunker instead of a FastCDC and noticed that compression ratios dropped substantially. Looking deeper into the issue I've found that most of the chunks produced were right at the lower bound of the
MinSize
(which is set pretty low for my use case: 16384).I've narrowed down the issue to https://github.com/restic/chunker/blob/4b72965764af3fe1ba7d7487cbb5a3b6787197c5/chunker.go#L311
Changing the code to use a different constant (e.g. == 1) fixes the problem. So the distribution of the
digest
values are likely to blame here. Math in the chunker is above my ability to review but I would assume that for chunker to work reasonably welldigest
should be uniformly distributed.