Separate the rolling hash function from the chunker

bchapuis commented 9 years ago

At the moment the rolling hash function is integrated in the chunker. I think it would make sense to separate the rolling hash function from the chunker and declare interfaces for both structs. This would allows to implement and compare different kind of chunking algorithms (fixed size, thresholds, etc.), and to implement different rolling hash functions (rabinkarp, buzhash, rolling adler, etc).

bchapuis commented 9 years ago

Today, some operations are performed inline (without function call) for performance reason. What is the impact of function calls in this situation?

fd0 commented 9 years ago

When I implemented the chunker last year, it made a significant difference (I started implementing it this way), but we can try again with a recent version of Go. I think I don't have the benchmarks nor the code anywhere, but we can easily change things, run the benchmarks and use benchcmp to compare them.

bchapuis commented 9 years ago

Here are the results of the benchmark for the master branch:

PASS
BenchmarkChunkerWithSHA256        10     122541194 ns/op      85.57 MB/s
--- BENCH: BenchmarkChunkerWithSHA256
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
BenchmarkChunkerWithMD5       20      82170451 ns/op     127.61 MB/s
--- BENCH: BenchmarkChunkerWithMD5
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
BenchmarkChunker          20      66481903 ns/op     157.72 MB/s
--- BENCH: BenchmarkChunker
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
BenchmarkNewChunker    20000         77459 ns/op
BenchmarkPolDivMod  20000000            63.6 ns/op
BenchmarkPolDiv 20000000            64.2 ns/op
BenchmarkPolMod 20000000            65.7 ns/op
BenchmarkPolDeg 30000000            41.6 ns/op
BenchmarkRandomPolynomial          5     452257005 ns/op
BenchmarkPolIrreducible        5     262396181 ns/op
ok      github.com/restic/chunker   20.659s

I just tried to separate the chunker from the rolling hash function in a local branch and here are the results of the benchmark:

PASS
BenchmarkChunkerWithSHA256        10     141810242 ns/op      73.94 MB/s
--- BENCH: BenchmarkChunkerWithSHA256
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
BenchmarkChunkerWithMD5       10     103197901 ns/op     101.61 MB/s
--- BENCH: BenchmarkChunkerWithMD5
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
BenchmarkChunker          20      81411794 ns/op     128.80 MB/s
--- BENCH: BenchmarkChunker
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
    chunker_test.go:274: 6 chunks, average chunk size: 1747626 bytes
BenchmarkNewChunker    20000         78330 ns/op
BenchmarkPolDivMod  30000000            58.8 ns/op
BenchmarkPolDiv 20000000            64.8 ns/op
BenchmarkPolMod 20000000            64.1 ns/op
BenchmarkPolDeg 50000000            40.5 ns/op
BenchmarkRandomPolynomial          3     507418834 ns/op
BenchmarkPolIrreducible        5     266704150 ns/op
ok      github.com/restic/chunker   20.340s

So it seems to be a ~15% overhead.

Having an interface would still leave the possibility to have several chunker implementations with an highly optimized one.

fd0 commented 9 years ago

That's consistent with my memory. Btw, here is the output of benchcmp on the two benchmark runs:

benchmark                      old ns/op     new ns/op     delta
BenchmarkChunkerWithSHA256     122541194     141810242     +15.72%
BenchmarkChunkerWithMD5        82170451      103197901     +25.59%
BenchmarkChunker               66481903      81411794      +22.46%
BenchmarkNewChunker            77459         78330         +1.12%
BenchmarkPolDivMod             63.6          58.8          -7.55%
BenchmarkPolDiv                64.2          64.8          +0.93%
BenchmarkPolMod                65.7          64.1          -2.44%
BenchmarkPolDeg                41.6          40.5          -2.64%
BenchmarkRandomPolynomial      452257005     507418834     +12.20%
BenchmarkPolIrreducible        262396181     266704150     +1.64%

benchmark                      old MB/s     new MB/s     speedup
BenchmarkChunkerWithSHA256     85.57        73.94        0.86x
BenchmarkChunkerWithMD5        127.61       101.61       0.80x
BenchmarkChunker               157.72       128.80       0.82x

chmduquesne commented 7 years ago

@bchapuis I have been maintaining (since early 2015) a library of rolling hashes. I recently realized that my implementation of the rabin karp algorithm is flawed (which somehow brought me here, looking for good code), but you are welcome to have a look/contribute!

https://github.com/chmduquesne/rollinghash

chmduquesne commented 6 years ago

I have been porting this code to https://github.com/chmduquesne/rollinghash. It (mostly) works!

[Edit] it fully works now (passes all tests).

fd0 commented 3 years ago

I'm closing this issue, @chmduquesne's library has the rolling hash function alone. Thanks!

restic / chunker

Separate the rolling hash function from the chunker #2