Closed zenhack closed 4 years ago
Re: not using b
, I had another look at go4.org/rollsum
, and it looks like there is a slight discrepancy between what the spec for rrs1 (and the rsync docs) specifies and what that package does. Specifically, the spec says s(k, l) = a(k, l) + 2^16 * b(k, l)
, but go4.org/rollsum
seems to be computing it as 2^16 * a(k, l) + b(k, l)
, which seems more reasonable in terms of the consequences for the splitting algorithm. We should have another look at what bup does here and bring the spec in-line with those implementations.
Can confirm bup does the same thing: https://github.com/bup/bup/blob/master/lib/bup/bupsplit.c#L80
...I also had a look at the source code for librsync, which appears to still match the docs, except that they're also using C = 31
now, where as the rsync doc uses C = 0
. So the only difference between librsync and bup/perkeep is that the high and low u16s of the digest are flipped.
I pushed a change that accounts for the differences between librsync and the other systems, rrs1
is now what perkeep and bup do, while rrs0
is what librsync does. rrs1
more suitable for hash splitting than rrs0
, but we should still do some research for a recommended hash.
I did a quick test to see what the mean block size given by rrs1
was in practice, using this program:
package main
import (
"bufio"
"fmt"
"math/rand"
"go4.org/rollsum"
)
func main() {
r := bufio.NewReader(rand.New(rand.NewSource(31)))
nBytes := int(1e8)
nChunks := 0
rs := rollsum.New()
for i := 0; i < nBytes; i++ {
ch, err := r.ReadByte()
if err != nil {
panic(err)
}
rs.Roll(ch)
if rs.OnSplit() {
nChunks++
}
}
fmt.Println("nBytes = ", nBytes)
fmt.Println("nChunks = ", nChunks)
fmt.Println("Mean chunk size = ", nBytes/nChunks)
}
The output:
nBytes = 100000000
nChunks = 12072
Mean chunk size = 8283
(The seed was an arbitrary choice; first number that popped into my head).
...So, despite the statistical quirks @bobg has observed, it seems like at least on random data the hash still gives roughly the size that you would expect (~8KiB) from checking the bottom 13 bits.
Note that rrs
was intentionally making a perf vs. reliability trade-off when designed, and given that the basic block size seems to be right, I'm once again skeptical that it's bad enough to outweigh adding the implementation burden of extra hash functions, but am still open to discussion.
For time being I'm going to focus on other parts of the spec.
Ok, I did a bit more testing and found that the block size does get wonky above 15 bits; 15 bits gets you ~32KiB chunks as expected, but 16 gives you ~53KiB (vs an expected 64KiB), 17 bits gives you ~55KiB, and after that it just stops splitting -- I tried feeding it up to a gigabyte of data and at 18 bits it just ends up as one big chunk.
Looking at the details of the hash, it makes sense that the quality of the upper 16 bits is much much worse than the lower, since the upper bits are just the sum of everything in the window (plus a constant), whereas the lower bits are the result of further processing. And it's not really surprising that the 16th bit suffers a little too, as it's probably not getting the benefit of overflows that the other bits get.
So at the very least we should document that rrs1 isn't really suitable for block sizes over 32KiB, but unless problems can be shown with the low bits I think it's fine as a default.
We should just remove rrs0
from the spec entirely, since it's got the bad bits in the lower position, so it's basically useless and it would simplify the presentation to not parametrize rrs
over the difference between the two.
Thinking about this again, actually: while the low bits are good enough for the splitting, when you factor in the leveling algorithm it gets dicey; with only 2-3 extra bits before the hash starts misbehaving, for the level off of a baseline 13 bit threshold, this means the leveling algorithm is going to basically fall over for files bigger than 64KiB. So we probably should pick an alternate hash function if we want leveling to work correctly.
See also the 'weaknesses' section here:
https://en.wikipedia.org/wiki/Adler-32
(note that rrs1 is closely related to adler-32)
with only 2-3 extra bits before the hash starts misbehaving
Just so; plus, for my purposes at least, having done some measurements I'm starting to think that 14 or 15 threshold bits might be better than 13 in terms of the tradeoff between duplicate suppression and tree overhead.
Quoting Bob Glickstein (2020-08-30 23:03:48)
Just so; plus, for my purposes at least, having done some measurements I'm starting to think that 14 or 15 threshold bits might be better than 13 in terms of the tradeoff between duplicate suppression and tree overhead.
Can you share your measurements? I'd be curious as to exactly what we're looking at.
FWIW, perkeep sets a minimum block size of 64KiB, so blocks are much larger than the 8KiB that the base threshold of 13 would suggest, so there are ways of dealing with the tree overhead other than increasing the threshold.
Re: tree overhead, I wonder if it would make sense to try to increase the branching factor somehow (we could have min/max sizes for nodes and/or use more than one bit per level)? As currently described in @cole-miller's pr, the algorithm yields probabilistically-binary trees, but for persistent storage you usually want a large branching factor to reduce disk activity.
As currently described in @cole-miller's pr, the algorithm yields probabilistically-binary trees
I'm not sure this is true? With the definition(s) I've been using (based on SPLIT) nodes can have arbitrarily many children, and probabilistically you'd expect the number to average 2^S
, where S
is the "step" (= increment in the number of bits we examine between adjacent levels).
I wrote that before you put the step language in there, so I was assuming step = 1. :P but yes, with the current draft the average branching factor would depend on the step.
On the subject of testing in general, I've been wondering what sort of input is most appropriate as a proving ground for hash functions. The bytes in a random file on your hard drive are probably not IID uniformly distributed; on the other hand, well-compressed data should pretty much resemble random bytes. One model I considered, without any serious justification, is a concatenation of alternating runs of 0s and 1s, the run lengths being independent and geometrically distributed with some chosen parameter(s). A representative corpus of files that might end up in a blob store would probably make more sense.
My only other thought was that it would be worth looking at the empirical distribution of chunk sizes produced by a candidate hash function (on whatever kind of input), since this is what we ultimately care about. We want the mean/median/mode of this distribution to approximate a particular size, but also for the variance not to be too large, and especially for not too much mass to lie in the tail (representing really big chunks, which make our lives harder).
All of this sounds sensible to me. We could tweak @bobg's evaluation tool to take its input from a file and experiment with different kinds of data, and add the tests for actual sizes, as you suggest. PRs welcome.
Quoting Cole Miller (2020-09-01 22:39:14)
On the subject of testing in general, I've been wondering what sort of input is most appropriate as a proving ground for hash functions. The bytes in a random file on your hard drive are probably not IID uniformly distributed; on the other hand, well-compressed data should pretty much resemble random bytes. One model I considered, without any serious justification, is a concatenation of alternating runs of 0s and 1s, the run lengths being independent and geometrically distributed with some chosen parameter(s). A representative corpus of files that might end up in a blob store would probably make more sense.
My only other thought was that it would be worth looking at the empirical distribution of chunk sizes produced by a candidate hash function (on whatever kind of input), since this is what we ultimately care about. We want the mean/median/mode of this distribution to approximate a particular size, but also for the variance not to be too large, and especially for not too much mass to lie in the tail (representing really big chunks, which make our lives harder).
-- You are receiving this because you authored the thread. Reply to this email directly, [1]view it on GitHub, or [2]unsubscribe.
Verweise
Something I wasn't thinking of in the previous comment: the maximum and minimum chunk sizes, S_min
and S_max
, are of course a way of controlling the size distribution. In particular, S_max
takes care of the "tail" problem, independent of the hash function. So while it might still be worth trying to control the variance in chunk size, we already have some mitigation in place.
So, let's discuss some alternatives. Below is the output of @bobg's evaluation tool on my laptop. @bobg, I see that in your implementation you've switched over to bozo32. From the below that seems reasonable as it's both one of the fastest (beaten only by rrs1/rollsum) and fairly strong statistically (no unbalanced bits, no bit-pair correlations)
Buzhash (both 32 and 64) have better (closer to 50%) variance in the low bits, though none of the hashes are that close to the ideal of 50% for any of the bits, let alone most or all of them.
I could be convinced to recommend either bozo32 or buzhash. @cole-miller, @bobg, any opinions/extra insight?
Using RNG seed 1600376130
rollsum:
Elapsed time to roll/digest a megabyte of random data: 33.083602ms
Bits departing from 50% likelihood of being zero:
Bit 15 is zero 48.3% of the time
Bit 26 is zero 46.9% of the time
Bit 27 is zero 56.0% of the time
Bit 28 is zero 99.9% of the time
Bit 29 is zero 0.1% of the time
Bit 30 is zero 100.0% of the time
Bit 31 is zero 100.0% of the time
Bit-pair correlations departing from 50% likelihood:
Bit 12 == bit 15 51.2% of the time
Bit 13 == bit 15 52.4% of the time
Bit 14 == bit 15 56.1% of the time
Bit 14 == bit 25 54.2% of the time
Bit 14 == bit 27 48.4% of the time
Bit 15 == bit 25 54.3% of the time
Bit 15 == bit 26 73.3% of the time
Bit 15 == bit 27 30.4% of the time
Bit 15 == bit 28 48.3% of the time
Bit 15 == bit 29 51.7% of the time
Bit 15 == bit 30 48.3% of the time
Bit 15 == bit 31 48.3% of the time
Bit 21 == bit 27 48.9% of the time
Bit 22 == bit 26 51.1% of the time
Bit 22 == bit 27 47.9% of the time
Bit 23 == bit 26 52.3% of the time
Bit 23 == bit 27 45.7% of the time
Bit 24 == bit 26 54.8% of the time
Bit 24 == bit 27 41.4% of the time
Bit 25 == bit 26 61.5% of the time
Bit 25 == bit 27 31.8% of the time
Bit 26 == bit 27 9.0% of the time
Bit 26 == bit 28 46.9% of the time
Bit 26 == bit 29 53.1% of the time
Bit 26 == bit 30 46.9% of the time
Bit 26 == bit 31 46.9% of the time
Bit 27 == bit 28 56.1% of the time
Bit 27 == bit 29 43.9% of the time
Bit 27 == bit 30 56.0% of the time
Bit 27 == bit 31 56.0% of the time
Bit 28 == bit 29 0.0% of the time
Bit 28 == bit 30 99.9% of the time
Bit 28 == bit 31 99.9% of the time
Bit 29 == bit 30 0.1% of the time
Bit 29 == bit 31 0.1% of the time
Bit 30 == bit 31 100.0% of the time
On 1-bit input change, digest bits departing from 50% likelihood of change:
Bit 0 varied 1.6% of the time
Bit 1 varied 3.1% of the time
Bit 2 varied 4.8% of the time
Bit 3 varied 6.3% of the time
Bit 4 varied 7.6% of the time
Bit 5 varied 9.1% of the time
Bit 6 varied 11.1% of the time
Bit 7 varied 11.5% of the time
Bit 8 varied 10.0% of the time
Bit 9 varied 9.1% of the time
Bit 10 varied 6.4% of the time
Bit 11 varied 6.3% of the time
Bit 12 varied 6.2% of the time
Bit 13 varied 1.8% of the time
Bit 14 varied 1.1% of the time
Bit 15 varied 0.0% of the time
Bit 16 varied 3.1% of the time
Bit 17 varied 4.9% of the time
Bit 18 varied 4.7% of the time
Bit 19 varied 6.2% of the time
Bit 20 varied 7.6% of the time
Bit 21 varied 9.3% of the time
Bit 22 varied 4.7% of the time
Bit 23 varied 5.1% of the time
Bit 24 varied 3.5% of the time
Bit 25 varied 3.5% of the time
Bit 26 varied 0.0% of the time
Bit 27 varied 0.0% of the time
Bit 28 varied 0.0% of the time
Bit 29 varied 0.0% of the time
Bit 30 varied 0.0% of the time
Bit 31 varied 0.0% of the time
adler32:
Elapsed time to roll/digest a megabyte of random data: 39.965998ms
Bits departing from 50% likelihood of being zero:
Bit 11 is zero 48.2% of the time
Bit 12 is zero 48.2% of the time
Bit 13 is zero 51.8% of the time
Bit 14 is zero 100.0% of the time
Bit 15 is zero 100.0% of the time
Bit 31 is zero 52.0% of the time
Bit-pair correlations departing from 50% likelihood:
Bit 5 == bit 11 51.1% of the time
Bit 5 == bit 12 51.1% of the time
Bit 5 == bit 13 48.9% of the time
Bit 6 == bit 10 51.1% of the time
Bit 6 == bit 11 52.1% of the time
Bit 6 == bit 12 52.1% of the time
Bit 6 == bit 13 47.9% of the time
Bit 7 == bit 10 52.3% of the time
Bit 7 == bit 11 54.2% of the time
Bit 7 == bit 12 54.2% of the time
Bit 7 == bit 13 45.8% of the time
Bit 8 == bit 10 54.9% of the time
Bit 8 == bit 11 58.6% of the time
Bit 8 == bit 12 58.7% of the time
Bit 8 == bit 13 41.3% of the time
Bit 9 == bit 10 61.9% of the time
Bit 9 == bit 11 68.4% of the time
Bit 9 == bit 12 68.5% of the time
Bit 9 == bit 13 31.5% of the time
Bit 9 == bit 30 52.9% of the time
Bit 9 == bit 31 54.0% of the time
Bit 10 == bit 11 91.4% of the time
Bit 10 == bit 12 91.5% of the time
Bit 10 == bit 13 8.5% of the time
Bit 10 == bit 31 71.5% of the time
Bit 11 == bit 12 99.9% of the time
Bit 11 == bit 13 0.1% of the time
Bit 11 == bit 14 48.2% of the time
Bit 11 == bit 15 48.2% of the time
Bit 11 == bit 30 51.1% of the time
Bit 11 == bit 31 68.3% of the time
Bit 12 == bit 13 0.0% of the time
Bit 12 == bit 14 48.2% of the time
Bit 12 == bit 15 48.2% of the time
Bit 12 == bit 30 51.1% of the time
Bit 12 == bit 31 68.4% of the time
Bit 13 == bit 14 51.8% of the time
Bit 13 == bit 15 51.8% of the time
Bit 13 == bit 30 48.9% of the time
Bit 13 == bit 31 31.6% of the time
Bit 14 == bit 15 100.0% of the time
Bit 14 == bit 31 52.0% of the time
Bit 15 == bit 31 52.0% of the time
Bit 28 == bit 31 51.1% of the time
Bit 29 == bit 31 52.5% of the time
Bit 30 == bit 31 55.9% of the time
On 1-bit input change, digest bits departing from 50% likelihood of change:
Bit 0 varied 3.1% of the time
Bit 1 varied 4.5% of the time
Bit 2 varied 6.1% of the time
Bit 3 varied 7.5% of the time
Bit 4 varied 9.0% of the time
Bit 5 varied 10.6% of the time
Bit 6 varied 4.7% of the time
Bit 7 varied 5.8% of the time
Bit 8 varied 4.3% of the time
Bit 9 varied 4.3% of the time
Bit 10 varied 0.0% of the time
Bit 11 varied 0.0% of the time
Bit 12 varied 0.0% of the time
Bit 13 varied 0.0% of the time
Bit 14 varied 0.0% of the time
Bit 15 varied 0.0% of the time
Bit 16 varied 1.9% of the time
Bit 17 varied 3.5% of the time
Bit 18 varied 4.6% of the time
Bit 19 varied 6.2% of the time
Bit 20 varied 8.1% of the time
Bit 21 varied 9.7% of the time
Bit 22 varied 11.3% of the time
Bit 23 varied 12.8% of the time
Bit 24 varied 9.2% of the time
Bit 25 varied 8.8% of the time
Bit 26 varied 6.7% of the time
Bit 27 varied 6.7% of the time
Bit 28 varied 3.2% of the time
Bit 29 varied 2.6% of the time
Bit 30 varied 0.3% of the time
Bit 31 varied 0.3% of the time
bozo32:
Elapsed time to roll/digest a megabyte of random data: 34.135245ms
Bits departing from 50% likelihood of being zero:
Bit-pair correlations departing from 50% likelihood:
On 1-bit input change, digest bits departing from 50% likelihood of change:
Bit 0 varied 3.1% of the time
Bit 1 varied 4.9% of the time
Bit 2 varied 4.7% of the time
Bit 3 varied 6.2% of the time
Bit 4 varied 9.2% of the time
Bit 5 varied 7.8% of the time
Bit 6 varied 9.2% of the time
Bit 7 varied 10.6% of the time
Bit 8 varied 10.3% of the time
Bit 9 varied 12.1% of the time
Bit 10 varied 10.9% of the time
Bit 11 varied 12.7% of the time
Bit 12 varied 13.0% of the time
Bit 13 varied 11.9% of the time
Bit 14 varied 11.9% of the time
Bit 15 varied 11.6% of the time
Bit 16 varied 13.1% of the time
Bit 17 varied 11.2% of the time
Bit 18 varied 12.0% of the time
Bit 19 varied 11.3% of the time
Bit 20 varied 12.4% of the time
Bit 21 varied 11.8% of the time
Bit 22 varied 12.7% of the time
Bit 23 varied 12.1% of the time
Bit 24 varied 12.0% of the time
Bit 25 varied 11.7% of the time
Bit 26 varied 11.7% of the time
Bit 27 varied 11.8% of the time
Bit 28 varied 12.5% of the time
Bit 29 varied 12.3% of the time
Bit 30 varied 12.7% of the time
Bit 31 varied 12.0% of the time
buzhash32:
Elapsed time to roll/digest a megabyte of random data: 35.896563ms
Bits departing from 50% likelihood of being zero:
Bit-pair correlations departing from 50% likelihood:
On 1-bit input change, digest bits departing from 50% likelihood of change:
Bit 0 varied 11.9% of the time
Bit 1 varied 12.3% of the time
Bit 2 varied 12.7% of the time
Bit 3 varied 12.0% of the time
Bit 4 varied 13.4% of the time
Bit 5 varied 12.3% of the time
Bit 6 varied 13.3% of the time
Bit 7 varied 12.3% of the time
Bit 8 varied 13.4% of the time
Bit 9 varied 13.2% of the time
Bit 10 varied 13.1% of the time
Bit 11 varied 12.5% of the time
Bit 12 varied 12.4% of the time
Bit 13 varied 12.3% of the time
Bit 14 varied 12.5% of the time
Bit 15 varied 11.7% of the time
Bit 16 varied 12.7% of the time
Bit 17 varied 12.8% of the time
Bit 18 varied 12.9% of the time
Bit 19 varied 12.1% of the time
Bit 20 varied 12.7% of the time
Bit 21 varied 13.2% of the time
Bit 22 varied 11.8% of the time
Bit 23 varied 12.1% of the time
Bit 24 varied 12.4% of the time
Bit 25 varied 12.3% of the time
Bit 26 varied 11.9% of the time
Bit 27 varied 12.9% of the time
Bit 28 varied 11.3% of the time
Bit 29 varied 12.0% of the time
Bit 30 varied 11.5% of the time
Bit 31 varied 13.2% of the time
buzhash64:
Elapsed time to roll/digest a megabyte of random data: 37.685256ms
Bits departing from 50% likelihood of being zero:
Bit-pair correlations departing from 50% likelihood:
On 1-bit input change, digest bits departing from 50% likelihood of change:
Bit 0 varied 12.0% of the time
Bit 1 varied 12.0% of the time
Bit 2 varied 11.9% of the time
Bit 3 varied 11.9% of the time
Bit 4 varied 12.3% of the time
Bit 5 varied 11.8% of the time
Bit 6 varied 12.6% of the time
Bit 7 varied 13.1% of the time
Bit 8 varied 12.6% of the time
Bit 9 varied 13.1% of the time
Bit 10 varied 12.9% of the time
Bit 11 varied 12.2% of the time
Bit 12 varied 13.0% of the time
Bit 13 varied 11.6% of the time
Bit 14 varied 13.1% of the time
Bit 15 varied 12.0% of the time
Bit 16 varied 12.5% of the time
Bit 17 varied 12.8% of the time
Bit 18 varied 12.2% of the time
Bit 19 varied 11.9% of the time
Bit 20 varied 11.2% of the time
Bit 21 varied 12.1% of the time
Bit 22 varied 11.5% of the time
Bit 23 varied 13.1% of the time
Bit 24 varied 11.9% of the time
Bit 25 varied 12.1% of the time
Bit 26 varied 11.7% of the time
Bit 27 varied 12.3% of the time
Bit 28 varied 11.4% of the time
Bit 29 varied 12.4% of the time
Bit 30 varied 11.2% of the time
Bit 31 varied 12.7% of the time
any opinions/extra insight?
Not about the choice of algorithm to recommend. However, before settling on one there may be additional literature to survey first. I ran across these links today but haven't had time to process them yet:
Thanks for the links. I skimmed the first paper and read the beginning of the second (I hadn't previously known about restic, but it doesn't sound like they're doing anything novel there that we didn't already know about).
With the first paper, I don't buy their argument that low-entropy strings are not a problem in practice; they basically assume away the problem by suggesting that input data is random, but the pathological case that they're worried about is actually really common. e.g. java class files start with 0xCAFEBABE, which has a large 0xFE byte in it. This kind of thing is going to be common with files identified by "magic number", since programmers are likely to bias towards using letters (large numbers). The Sandstorm package format always contains 0xEF int he first 4 bytes, and I'm sure there are many other such formats. Furthermore, their main advantage is throughput, which I don't think is enough of a concern for our use cases to be worth the risk.
The second paper sounds interesting, but it looks like their algorithm relies on querying the data store to determine if a chunk is already present whereas the existing approach doesn't require this. This strikes me as a major downside; network latency could really hurt for some use cases, and much of the point of standardizing this is to make the splits stable across systems; it would be a shame to use an algorithm where they're not necessarily even stable within the same system.
I added some tests to the evaluation tool, which now reports statistics about chunk sizes.
Per my critique of the first paper, we really should test this on non-random data ourselves.
Thinking about this again. I'm kindof leaning away from bozo32, just because using it would mean having to reverse a declarative description from the library's code; per bozo32.go:
// Package rollinghash/bozo32 is a wrong implementation of the rabinkarp
// checksum. In practice, it works very well and exhibits all the
// properties wanted from a rolling checksum, so after realising that this
// code did not implement the rabinkarp checksum as described in the
// original paper, it was renamed from rabinkarp32 to bozo32 and kept
// in this package.
I don't really want to stare at it and figure out exactly what the bug is, so that we can describe it in an algebraic way.
Any objections to standardizing on some version of buzhash32? If I don't hear in a few days I'll start banging out a description pinning down some parameters; it may or may not be exactly what the Go package uses; I'll have to look and see what makes sense.
Any objections to standardizing on some version of buzhash32?
None - in fact I was going to make the same suggestion. :+1:
No objections from me either.
On the perkeep mailing list thread where this project got started, @bobg noted some observed statistical weaknesses in the
rrs1
hash in the spec and argued for defining and recommending a stronger hash. We should agree on what that hash should be.rrs
really is quite primitive. I was staring at it a bit just now, and for the purposes of perkeep & bup, it more or less simplifies to literally justsum of the bytes in the window + (window size * C)
adding up the bytes you're hashing (with an offset of 31), sinceb(k, l)
does not affect the the low 16 bits of the hash, and these systems only look at the bottom 13 bits (though it may affect the leveling). I based the spec on the rsync document, and I remember looking atgo4.org/rollsum
and thinking this looked like it was doing the same thing, but I should do another pass and sanity check that they really are the same algorithm. If so, the package is doing useless work computingb
ands
at all.I think we should still have it in the spec, since it's useful to be able to set parameters to match an existing system, but perhaps we should pin down a more reasonable hash to recommend as a default.
I think my minimum requirement here is to be able to say with a straight face that if your splitting condition is "bottom N bits are zero" you'll actually get a mean size of 2^N bytes, but @bobg said he saw some correlation between pairs of bits, so it's likely that the actual block size for
rrs1
is smaller than that.