Avoid allocating table on stack.

klauspost commented 8 years ago

The current implementation allocates a 64-128KB table on the stack on every call to Encode.

While this is reasonably fast, with a slight modification we can reuse this table across calls. This is a particular gain for the framing encoder, but even for the stateless Encode function this gives a slight speedup.

Furthermore the table is now int32, which saves 50% memory on 64 bit systems. This change does not itself affect the speed, but only the memory usage.

To be able to reuse buffers across calls to Encode we store the used tables in a sync.Pool.

The table size is now fixed. We can experiment with the size later.

>go test -bench=ZFlat >new.txt && benchcmp old.txt new.txt
benchmark               old ns/op     new ns/op     delta
Benchmark_ZFlat0-8      347674        334371        -3.83%
Benchmark_ZFlat1-8      4742511       4677414       -1.37%
Benchmark_ZFlat2-8      1098529       1043970       -4.97%
Benchmark_ZFlat3-8      1101338       1052056       -4.47%
Benchmark_ZFlat4-8      789205        769194        -2.54%
Benchmark_ZFlat5-8      1330515       1349258       +1.41%
Benchmark_ZFlat6-8      1145645       1139680       -0.52%
Benchmark_ZFlat7-8      1051315       988243        -6.00%
Benchmark_ZFlat8-8      3153910       3096453       -1.82%
Benchmark_ZFlat9-8      3997054       4006660       +0.24%
Benchmark_ZFlat10-8     348186        339939        -2.37%
Benchmark_ZFlat11-8     935487        925993        -1.01%

benchmark               old MB/s     new MB/s     speedup
Benchmark_ZFlat0-8      294.53       306.25       1.04x
Benchmark_ZFlat1-8      148.04       150.10       1.01x
Benchmark_ZFlat2-8      112.05       117.91       1.05x
Benchmark_ZFlat3-8      111.77       117.00       1.05x
Benchmark_ZFlat4-8      129.75       133.13       1.03x
Benchmark_ZFlat5-8      307.85       303.57       0.99x
Benchmark_ZFlat6-8      132.75       133.45       1.01x
Benchmark_ZFlat7-8      119.07       126.67       1.06x
Benchmark_ZFlat8-8      135.31       137.82       1.02x
Benchmark_ZFlat9-8      120.55       120.26       1.00x
Benchmark_ZFlat10-8     340.59       348.85       1.02x
Benchmark_ZFlat11-8     197.03       199.05       1.01x

nigeltao commented 8 years ago

I still have some significant concerns about this PR in its entirety. One point is that it does at least three things: it changes the hash table value type from int to int32, it uses a sync.Pool to avoid mallocs and subsequent GC, and uses an 'encoder.cur' value to avoid having to reset the hash table to zeroes. Sure each change is a small amount of code, but they are separable changes.

As I said on https://go-review.googlesource.com/#/c/19336/:

""" If it turns out that things A, B and D are all relatively simple code changes and gets you +9%, and C is very complicated and gets you +1%, I would recommend rejecting C even if that +1% was statistically significant. Raw MB/s throughput is not the only criteria in which the Go standard library code is judged. As this CL is packaged now, though, we either have to accept all four or reject all four.

Accordingly, my recommendation would be to break this CL up into a smaller ones... As always, smaller yet still self-contained CLs are easier to review, and easier to manage (e.g. roll back if necessary). """

Another point is that, as I said above a few days ago, I'm still not 100% convinced that the e.cur mechanism is correct when it overflows. You check for t < 0, and then for t >= s (via the offset variable), but I think that it is still possible to have 0 <= t && t < s yet the hash table entry being from a previous sync.Pool use. Sure, it would be very, very rare for that to happen, but also, very, very hard to reproduce bugs are the hardest ones to debug.

Related to that concern is code like "*p = int32(s + sadd)" where s (an int, being in the range 0 <= s && s < len(src)) doesn't necessarily fit into an int32 and you can silently overflow.

Sure, what you've done may actually end up being 100% safe, but I'd need some convincing, and I'd need that proof to be written as a comment, and it'd have to be something longer than "If t >= s this will be negative, when converted to a uint this will always be > maxOffset."

As I also said above a few days ago, I'd like to see additional tests, especially around the wrap point. As it is, there are no new tests in this PR.

In a previous PR, #19, you have already expressed some frustration that you are wasting your time with my code review of your pull requests.

You have also mentioned that your goals aren't necessarily the same as my goals, e.g. byte-for-byte compatibility with the C++ encoding algorithm.

Accordingly, I suspect that it's not worth us both spending time iterating on this PR until I'm satisfied with it. I'm happy to continue the conversation if I'm wrong and you do want to spend more time on it, but otherwise, I'll incorporate some of your suggestions as separate commits, such as cc71ae7.

nigeltao commented 8 years ago

I'm probably flogging a dead horse, in the metaphorical sense, but a "byte-for-byte compatibility with the C++ encoding algorithm" test would have at least detected the bug fixed by d1d908a2 for free, without requiring an explicit "I wonder if these benchmark improvements are too good to be true" investigation.

klauspost commented 8 years ago

Good point. Here is a test:

// TestEncodeNoiseThenRepeats encodes a 8MB block for which the first half is
// very incompressible and the second half is very compressible. The encoded
// form's length should be closer to 50% of the original length than 100%.
func TestEncodeNoiseThenRepeatsBig(t *testing.T) {
    const origLen = 8 << 20
    src := make([]byte, origLen)
    rng := rand.New(rand.NewSource(1))
    firstHalf, secondHalf := src[:origLen/2], src[origLen/2:]
    for i := range firstHalf {
        firstHalf[i] = uint8(rng.Intn(256))
    }
    for i := range secondHalf {
        secondHalf[i] = uint8(i >> 8)
    }
    dst := Encode(nil, src)
    if got, want := len(dst), origLen*3/4; got >= want {
        t.Fatalf("got %d encoded bytes, want less than %d", got, want)
    }
}

Once "skip" gets above maxOffset << 5, all matches will fail for the rest of the block. This means that nothing following a 1MB uncompressible block will ever be compressed.

The official Snappy has a maximum block size of 64k (not 1GB), which prevents this bug from showing up, but also eliminates all matches across blocks. My own solution is s += 1 + (((s-lit)>>4) & 127), which limits skips to 127 bytes, but keeps the progressive step-up.

So, you are left with the choice, if you want C++ compatibility and make compression worse(~1.5%), or keep the IMO better compressor?

btw, if you want compatibility in that aspect, you've made it very easy, it's just const maxInternalEncodeSrcLen = (1 << 16).

nigeltao commented 8 years ago

I went for C++ compatibility. As of 8939696c, it should now output byte-for-byte identical output as the C++ snappy encoder.

In an earlier comment, I noted that this PR does at least three things:

it changes the hash table value type from int to int32,
it uses a sync.Pool to avoid mallocs and subsequent GC, and
uses an 'encoder.cur' value to avoid having to reset the hash table to zeroes.

Point 1 was done separately in an earlier commit (cc71ae7c). Having reached C++ compatibility, I am curious about new numbers for points 2 and 3, but I would still like to see them as separate changes, and I would still like some extra tests if possible and especially extra commentary for point 3.

If you are happy to keep working on this PR to re-base and address those points, I'm willing to keep this PR open.

If you have no shortage of other things to do with your time, I'm otherwise happy to work on those points myself over the next week or two.

nigeltao commented 8 years ago

I re-based this patch (and committed it on a separate branch). I might have screwed up somewhere, but I'm not convinced that points 2 and 3 are a net win on the benchmarks any more, but they still make the code more complicated.

See https://github.com/golang/snappy/commit/fdf5d5de7f4e9fb3760d9421a87402c249a05326

nigeltao commented 7 years ago

The snappy codebase has changed a lot since this PR, so I'm closing this as obsolete. Thanks anyway for all of the optimization discussions.

golang / snappy

Avoid allocating table on stack. #23