Closed klauspost closed 11 months ago
R13 contains the pointer to the end of the source buffer, so there is no need to calculate it on every check when reading.
Performance increase depends on how often the code is hit, but seems often around 8% added thoughput.
>benchstat old.txt new.txt name old time/op new time/op delta _UFlat0-8 54.5µs ± 5% 50.0µs ± 3% -8.34% (p=0.000 n=10+10) _UFlat1-8 577µs ± 6% 541µs ± 2% -6.19% (p=0.000 n=10+10) _UFlat2-8 8.59µs ± 2% 8.74µs ± 5% ~ (p=0.287 n=9+10) _UFlat3-8 129ns ± 4% 125ns ± 3% -3.45% (p=0.001 n=9+9) _UFlat4-8 8.36µs ± 3% 7.68µs ± 4% -8.15% (p=0.000 n=9+10) _UFlat5-8 239µs ± 4% 219µs ± 1% -8.14% (p=0.000 n=10+9) _UFlat6-8 212µs ± 3% 206µs ± 2% -2.99% (p=0.001 n=10+10) _UFlat7-8 178µs ± 1% 175µs ± 1% -1.90% (p=0.000 n=9+9) _UFlat8-8 565µs ± 5% 552µs ± 2% ~ (p=0.052 n=10+10) _UFlat9-8 752µs ± 2% 739µs ± 2% -1.79% (p=0.007 n=10+10) _UFlat10-8 48.1µs ± 2% 43.9µs ± 2% -8.74% (p=0.000 n=10+10) _UFlat11-8 199µs ± 4% 197µs ± 2% ~ (p=0.436 n=10+10) name old speed new speed delta _UFlat0-8 1.88GB/s ± 5% 2.05GB/s ± 3% +9.05% (p=0.000 n=10+10) _UFlat1-8 1.22GB/s ± 6% 1.30GB/s ± 2% +6.49% (p=0.000 n=10+10) _UFlat2-8 14.3GB/s ± 2% 14.1GB/s ± 5% ~ (p=0.315 n=9+10) _UFlat3-8 1.53GB/s ± 7% 1.59GB/s ± 5% +3.85% (p=0.005 n=10+10) _UFlat4-8 12.2GB/s ± 5% 13.3GB/s ± 4% +9.50% (p=0.000 n=10+10) _UFlat5-8 1.72GB/s ± 4% 1.87GB/s ± 1% +8.80% (p=0.000 n=10+9) _UFlat6-8 716MB/s ± 3% 738MB/s ± 2% +3.06% (p=0.001 n=10+10) _UFlat7-8 702MB/s ± 1% 715MB/s ± 1% +1.93% (p=0.000 n=9+9) _UFlat8-8 756MB/s ± 5% 773MB/s ± 2% +2.28% (p=0.050 n=10+10) _UFlat9-8 641MB/s ± 2% 652MB/s ± 2% +1.81% (p=0.006 n=10+10) _UFlat10-8 2.47GB/s ± 2% 2.70GB/s ± 2% +9.56% (p=0.000 n=10+10) _UFlat11-8 928MB/s ± 4% 936MB/s ± 2% ~ (p=0.436 n=10+10)
R13 contains the pointer to the end of the source buffer, so there is no need to calculate it on every check when reading.
Performance increase depends on how often the code is hit, but seems often around 8% added thoughput.