[arm64] Performance regression from removal of 4x loop decoding

lizthegrey commented 1 year ago

https://github.com/pierrec/lz4/commit/1cbdd8169b0510935efd9164dbbea4dcda897fb6 appears to have removed the unrolled 4x copy, meaning the finishing copy after the remainder is less than 8 bytes long may do up to 7x unaligned individual byte copies, rather than the maximum of 3x individual unaligned byte copies from before. Profiling shows that we're seeing 10% of all time spent in the lz4 library being spent on the one line at https://github.com/pierrec/lz4/blob/v4/internal/lz4block/decode_arm64.s#L206

I suspect we need to put that aligned 4x unrolled access back, do you concur, @greatroar?

lizthegrey commented 1 year ago

pinging again @pierrec and @greatroar for thoughts. this is quite bad IMO.

  Total:    1174.89s   1174.89s (flat, cum) 10.73%
     31            .          .           // func decodeBlock(dst, src, dict []byte) int 
     32            .          .           TEXT ·decodeBlock(SB), NOFRAME+NOSPLIT, $0-80 
[...]
    183            .          .           copyMatchTry8: 
    184            .          .             // Copy doublewords if both len and offset are at least eight. 
    185            .          .             // A 16-at-a-time loop doesn't provide a further speedup. 
    186       30.24s     30.24s             CMP  $8, len 
    187        2.34s      2.34s             CCMP HS, offset, $8, $0 
    188        250ms      250ms             BLO  copyMatchLoop1 
    189            .          .            
    190        4.98s      4.98s             AND    $7, len, lenRem 
    191        9.83s      9.83s             SUB    $8, len 
    192            .          .           copyMatchLoop8: 
    193        3.85s      3.85s             MOVD.P 8(match), tmp1 
    194       62.76s     62.76s             MOVD.P tmp1, 8(dst) 
    195        2.55s      2.55s             SUBS   $8, len 
    196        1.93s      1.93s             BPL    copyMatchLoop8 
    197            .          .            
    198        910ms      910ms             MOVD (match)(len), tmp2 // match+len == match+lenRem-8. 
    199       24.24s     24.24s             ADD  lenRem, dst 
    200         60ms       60ms             MOVD $0, len 
    201        440ms      440ms             MOVD tmp2, -8(dst) 
    202        550ms      550ms             B    copyMatchDone 
    203            .          .            
    204            .          .           copyMatchLoop1: 
    205            .          .             // Byte-at-a-time copy for small offsets. 
    206      132.84s    132.84s             MOVBU.P 1(match), tmp2 
    207      177.05s    177.05s             MOVB.P  tmp2, 1(dst) 
    208        8.42s      8.42s             SUBS    $1, len 
    209        5.27s      5.27s             BNE     copyMatchLoop1 
    210            .          .            
    211            .          .           copyMatchDone: 
    212       18.42s     18.42s             CMP src, srcend 
    213       12.36s     12.36s             BNE loop

evanphx commented 10 months ago

Hi all, I know this is closed but some work over the holidays lead me to dig into this more and I wanted to get y'all thoughts on #215, directly relevant to this sort of unrolling. Thanks!

pierrec / lz4

[arm64] Performance regression from removal of 4x loop decoding #205