pierrec / lz4

LZ4 compression and decompression in pure Go
BSD 3-Clause "New" or "Revised" License
878 stars 142 forks source link

internal/lz4block: Copy literals of <=48 bytes through XMM registers in amd64 decoder #161

Closed greatroar closed 2 years ago

greatroar commented 2 years ago

Another optimization for the amd64 decoder, inspired by one of its comments:

name                old speed      new speed      delta
UncompressPg1661-8  1.15GB/s ± 1%  1.19GB/s ± 1%   +3.39%  (p=0.000 n=10+10)
UncompressDigits-8  1.89GB/s ± 0%  2.33GB/s ± 1%  +23.46%  (p=0.000 n=9+10)
UncompressTwain-8   1.19GB/s ± 1%  1.23GB/s ± 0%   +3.43%  (p=0.000 n=10+10)
UncompressRand-8    3.93GB/s ± 2%  3.96GB/s ± 1%     ~     (p=0.105 n=10+10)

The effect is most pronounced on Digits because 37.4% of its literals have lengths 17-48. In Twain and Pg1661, this is <4.1%.

This is faster than copying 32 bytes. At 64 bytes, digits gets faster still whlie Twain and Pg1661 get slightly slower.