Bounds checks elimination

mhr3 commented 2 weeks ago

I've noticed msgp's integers.go isn't using compiler tricks to remove bounds checks, after adding those benchmarks show pretty nice speedup:

name                   old time/op  new time/op  delta
IntegersUnix/Get-8     2.23ns ± 1%  0.30ns ± 4%  -86.55%  (p=0.000 n=9+10)
IntegersUnix/Put-8     2.63ns ± 0%  0.80ns ±32%  -69.45%  (p=0.000 n=9+10)
IntegersPrefix/u16-8   0.66ns ±10%  0.41ns ± 0%  -38.71%  (p=0.000 n=10+8)
IntegersPrefix/u32-8   1.10ns ±33%  0.48ns ±25%  -56.60%  (p=0.000 n=10+9)
IntegersPrefix/u64-8   1.73ns ±38%  0.52ns ± 0%  -70.14%  (p=0.000 n=10+9)
Integers/Int64/Put-8   1.64ns ±13%  0.53ns ± 1%  -67.63%  (p=0.000 n=10+8)
Integers/Int64/Get-8   0.30ns ± 3%  0.29ns ± 4%   -3.73%  (p=0.006 n=8+10)
Integers/Int32/Put-8   0.94ns ± 4%  0.46ns ±10%  -50.81%  (p=0.000 n=10+8)
Integers/Int32/Get-8   0.31ns ± 3%  0.29ns ± 3%   -5.12%  (p=0.000 n=8+10)
Integers/Int16/Put-8   0.65ns ± 4%  0.48ns ±42%  -26.61%  (p=0.002 n=10+10)
Integers/Int16/Get-8   0.30ns ± 3%  0.29ns ± 1%   -3.03%  (p=0.000 n=8+9)
Integers/Uint64/Put-8  1.56ns ± 3%  0.62ns ±27%  -60.40%  (p=0.000 n=8+10)
Integers/Uint64/Get-8  0.30ns ± 3%  0.30ns ± 1%     ~     (p=0.127 n=10+10)
Integers/Uint32/Put-8  0.96ns ± 2%  0.45ns ± 7%  -52.98%  (p=0.000 n=9+8)
Integers/Uint32/Get-8  0.31ns ± 3%  0.30ns ± 2%   -2.29%  (p=0.002 n=9+9)
Integers/Uint16/Put-8  0.65ns ± 3%  0.41ns ± 8%  -36.59%  (p=0.000 n=10+9)
Integers/Uint16/Get-8  0.34ns ±28%  0.30ns ± 2%     ~     (p=0.075 n=10+9)

The most significant speedup can be seen in the unix put/get functions - for some reason the compiler wasn't inlining them, but after using functions from the binary pkg it now is. (though doing the same for all the other functions showed a slight slowdown, so I didn't change those)

philhofer commented 2 weeks ago

Nice, thanks.

klauspost commented 2 weeks ago

Awesome find!

func getMint64(b []byte) int64 {
    return int64(binary.BigEndian.Uint64(b[1:9]))
}

... is a shorter version with only one check (spits out same code on amd64) - which is also inlined.

No need to change

klauspost commented 2 weeks ago

If you can get m.R.Next(...) and m.R.Peek(...) to inline the "happy path" that would probably give a similar speedup.

Tried it briefly, but even...

func (r *Reader) Next(n int) ([]byte, error) {
    // Have happy path be short and possible to inline.
    if len(r.data)-r.n >= n && r.state == nil {
        return r.data[r.n:], nil
    }
    return r.next(n)
}

// next is a helper function for Next to be called when the buffer does not contain n entries.
func (r *Reader) next(n int) ([]byte, error) {
...

...is too costly.

Also m.R.Peek(1) is so common that could have a specialized call func (r *Reader) PeekByte() (byte, error) - but again I couldn't get it to inline with some basic attempts.

It was slightly faster by having the "happy path" first, but nothing major. Maybe like 5%.

mhr3 commented 2 weeks ago

Awesome find!
func getMint64(b []byte) int64 {
  return int64(binary.BigEndian.Uint64(b[1:9]))
}
... is a shorter version with only one check (spits out same code on amd64) - which is also inlined.

No need to change

I did try this, but from my benchmarking this was teeny bit slower (on arm64), did not check the generated assembly though.

tinylib / msgp

Bounds checks elimination #348