Chunk size distribution problem.

SaveTheRbtz commented 2 years ago

I've tried to use restic's chunker instead of a FastCDC and noticed that compression ratios dropped substantially. Looking deeper into the issue I've found that most of the chunks produced were right at the lower bound of the MinSize (which is set pretty low for my use case: 16384).

I've narrowed down the issue to https://github.com/restic/chunker/blob/4b72965764af3fe1ba7d7487cbb5a3b6787197c5/chunker.go#L311

Changing the code to use a different constant (e.g. == 1) fixes the problem. So the distribution of the digest values are likely to blame here. Math in the chunker is above my ability to review but I would assume that for chunker to work reasonably well digest should be uniformly distributed.

SaveTheRbtz commented 2 years ago

Hm... never mind -- seems like I've misused the API: using NewWithBoundaries does not automatically adjusts the splitmask to be an average of the two (which is not even possible when there there is no power of two between them). Calling SetAverageBits would likely fix my problem of skewed chunk sizes.

SaveTheRbtz commented 2 years ago

Actually, it is not that simple. On a random input it is indeed working fine but if I use a skewed input then output chunk sizes are heavily skewed towards minSize (regardless of the polynomyal).

If I change (digest&c.splitmask) == 0 to (digest&c.splitmask) == 1 then chunk size distribution becomes way better.

Problem also goes away if we match on exact number of zeroes instead of greater or equal, e.g.:

bits.TrailingZeros64(digest) == c.averageBits

Looking closer at values of digest on split, seems like values of 0, 10000, 1000000, 100000000, 10000000000, and 1000000000000 dominate the splits when input is a tar file (regardless of the polynomial). This is likely due to .tar format having runs of zeros that are longer than rolling hash's windowSize (bumping it to >=512 also fixes the problem):

00000000: 7061 785f 676c 6f62 616c 5f68 6561 6465  pax_global_heade
00000010: 7200 0000 0000 0000 0000 0000 0000 0000  r...............
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................

[1] i'm using first 10 MiB of an uncompressed linux kernel tar as an input:

$ git archive --format tar -o ~/tmp/linux-v5.15.tar v5.15

fd0 commented 2 years ago

Hm, interesting observation! I don't have much time right now, but I remember that somewhere in the original paper an attribute of the digest algorithm was described: if you feed it blocks of zeroes, the digest stays at zero. That leads to blocks of zeroes yielding always the smalles block size possible. For our use case (restic backup program) with rather large min block sizes that's good, because it deduplicates the "zero block" and just stores it once. Changing that to 1 changes the behavior significantly, if that's good or bad depends on your use case I guess.

SaveTheRbtz commented 2 years ago

Is it possible to change it to either:

bits.TrailingZeros64(digest) == c.averageBits

or a more gentle:

(digest != 0 && (digest&c.splitmask) == 0)

this should not invalidate all the existing chunks but only ones that end with a long run of zeros and hence were erroneously aggregated. IIUC restic should not be too affected given that its default min/max sizes are enormous and therefore "zero-only" segments would not be de-duplicated only if all of them are between min and max: ones smaller than min are already skipped, and the ones bigger than max would be forcefully cut.

fd0 commented 2 years ago

I'm fine with proposed changes to this library, but the changes must stay backwards compatible. We need to do any changes so that (without opting in) the behavior must not change (e.g. yield different chunks).

SaveTheRbtz commented 2 years ago

What do you think about the following (API-compatible) change:

func New(rd io.Reader, pol Pol, opts ...Option) *Chunker

With options defined as following:

type Option func(*Chunker)

func WithBuffer(buf []byte) Option {
    return func(c *Chunker)  {
        c.buf = buf
    }
}

func WithBoundaries(min, max uint) Option {
    return func(c *Chunker) {
        c.MinSize = min
        c.MaxSize = max
    }
}

func WithAverageBits(averageBits int) Option {
    return func(c *Chunker) {
        c.splitmask = (1 << uint64(averageBits)) - 1
    }
}

This would allow to define more options in the future (including one for skipping 0 digests) without the need for more NewWith* constructors.

As a side benefit some of the methods can be simplified, e.g.: NewWithBoundaries:

func NewWithBoundaries(rd io.Reader, pol Pol, min, max uint) *Chunker {
    return New(rd, pol, WithBoundaries(min, max))
}

Reset:

func (c *Chunker) Reset(rd io.Reader, pol Pol, opts ...Option) {
    opts = append(opts, WithBuffer(c.buf))
    *c = *New(rd, pol, opts...)
}

ResetWithBoundaries:

func (c *Chunker) ResetWithBoundaries(rd io.Reader, pol Pol, min, max uint) {
    c.Reset(rd, pol, WithBoundaries(min, max))
}

A draft: https://github.com/restic/chunker/pull/37 (all existing tests pass w/o modification)

fd0 commented 2 years ago

Yeah, I'm familiar with this options pattern, and I like it :)

Thanks for working on this!

restic / chunker

Chunk size distribution problem. #36