pierrec / lz4

LZ4 compression and decompression in pure Go
BSD 3-Clause "New" or "Revised" License
878 stars 142 forks source link

[reader] optimize hot Reader.Read() path #200

Closed lizthegrey closed 2 years ago

lizthegrey commented 2 years ago

See https://gist.github.com/lizthegrey/0ce7f8cd4a70ecedb5c299dfc0332976 for full disassembly

    103       19.10s     19.25s           func (r *Reader) Read(buf []byte) (n int, err error) { 
    104          39s     39.31s             defer r.state.check(&err) 
[...]
    156       31.14s     59.20s             return 

A huge amount of call overhead is incurred running the defer state.check() that can be avoided on nil err.

lizthegrey commented 2 years ago

benchcmp stats: amd64:

benchmark                        old ns/op     new ns/op     delta
BenchmarkUncompress-12           5.96          5.95          -0.25%
BenchmarkUncompressPg1661-12     264289        260036        -1.61%
BenchmarkUncompressDigits-12     29851         29641         -0.70%
BenchmarkUncompressTwain-12      167690        164982        -1.61%
BenchmarkUncompressRand-12       4572          4132          -9.62%

benchmark                        old MB/s     new MB/s     speedup
BenchmarkUncompressPg1661-12     1427.86      1451.21      1.02x
BenchmarkUncompressDigits-12     3192.04      3214.57      1.01x
BenchmarkUncompressTwain-12      1529.03      1554.13      1.02x
BenchmarkUncompressRand-12       3587.52      3969.48      1.11x

benchmark                        old allocs     new allocs     delta
BenchmarkUncompress-12           0              0              +0.00%
BenchmarkUncompressPg1661-12     4              4              +0.00%
BenchmarkUncompressDigits-12     4              4              +0.00%
BenchmarkUncompressTwain-12      4              4              +0.00%
BenchmarkUncompressRand-12       4              4              +0.00%

benchmark                        old bytes     new bytes     delta
BenchmarkUncompress-12           0             0             +0.00%
BenchmarkUncompressPg1661-12     184           184           +0.00%
BenchmarkUncompressDigits-12     184           190           +3.26%
BenchmarkUncompressTwain-12      184           184           +0.00%
BenchmarkUncompressRand-12       185           185           +0.00%

arm64:

benchmark                       old ns/op     new ns/op     delta
BenchmarkUncompress-4           9.21          9.13          -0.88%
BenchmarkUncompressPg1661-4     946356        954336        +0.84%
BenchmarkUncompressDigits-4     62271         61885         -0.62%
BenchmarkUncompressTwain-4      598823        599040        +0.04%
BenchmarkUncompressRand-4       4577          4510          -1.46%

benchmark                       old MB/s     new MB/s     speedup
BenchmarkUncompressPg1661-4     398.76       395.42       0.99x
BenchmarkUncompressDigits-4     1530.14      1539.69      1.01x
BenchmarkUncompressTwain-4      428.18       428.02       1.00x
BenchmarkUncompressRand-4       3583.71      3637.39      1.01x

benchmark                       old allocs     new allocs     delta
BenchmarkUncompress-4           0              0              +0.00%
BenchmarkUncompressPg1661-4     4              4              +0.00%
BenchmarkUncompressDigits-4     4              4              +0.00%
BenchmarkUncompressTwain-4      4              4              +0.00%
BenchmarkUncompressRand-4       4              4              +0.00%

benchmark                       old bytes     new bytes     delta
BenchmarkUncompress-4           0             0             +0.00%
BenchmarkUncompressPg1661-4     184           184           +0.00%
BenchmarkUncompressDigits-4     184           197           +7.07%
BenchmarkUncompressTwain-4      707           184           -73.97%
BenchmarkUncompressRand-4       186           185           -0.54%

however, this will have a much larger effect on longer files where Read() is called many more times.

lizthegrey commented 2 years ago

Hm, this didn't have the effect I wanted at scale. I'll keep tweaking.