Closed twmb closed 8 years ago
It appears that this gets only slightly faster than 1.4.2, the perf of which was gained from 1.3 and lost in 1.5 and 1.6. Go tip shows perf getting back to 1.4.2 levels, meaning the asm will provide minimal benefit going forward.
Because of this, and the overhead that extra code would entail, I'll close this PR in favor of a new PR adding only the middle two commits.
Ok. I was wondering as well about how large would the impact w.r.t tip. Out of curiosity, do you have those numbers to share?
I incidentally confirm there's interest for the NewWithSeedxxx
constructors.
All comparisons are master
vs my branch asm_amd64
. Also, non-amd64 will have ~3ns overhead of my benchmarks because of the added function call overhead.
1.4.2, which doesn't take advantage of the assembly, but does duplicate the digest func bodies into SeedSum128 (note 2ns for every Benchmark32):
benchmark old ns/op new ns/op delta
Benchmark32_1 7.03 9.44 +34.28%
Benchmark32_2 7.85 10.1 +28.66%
Benchmark32_4 6.81 9.12 +33.92%
Benchmark32_8 8.04 11.0 +36.82%
Benchmark32_16 10.5 13.3 +26.67%
Benchmark32_32 15.9 19.4 +22.01%
Benchmark32_64 28.6 32.6 +13.99%
Benchmark32_128 55.2 63.8 +15.58%
Benchmark32_256 117 122 +4.27%
Benchmark32_512 234 239 +2.14%
Benchmark32_1024 468 472 +0.85%
Benchmark32_2048 936 939 +0.32%
Benchmark32_4096 1873 1875 +0.11%
Benchmark32_8192 3746 3743 -0.08%
Benchmark128_1 27.4 12.5 -54.38%
Benchmark128_2 28.2 12.8 -54.61%
Benchmark128_4 29.6 13.8 -53.38%
Benchmark128_8 32.3 17.1 -47.06%
Benchmark128_16 25.6 13.5 -47.27%
Benchmark128_32 28.6 15.7 -45.10%
Benchmark128_64 34.2 21.4 -37.43%
Benchmark128_128 46.5 33.3 -28.39%
Benchmark128_256 71.1 57.6 -18.99%
Benchmark128_512 120 111 -7.50%
Benchmark128_1024 222 211 -4.95%
Benchmark128_2048 420 409 -2.62%
Benchmark128_4096 815 803 -1.47%
Benchmark128_8192 1604 1596 -0.50%
1.5.3:
benchmark old ns/op new ns/op delta
Benchmark32_1-4 6.83 5.07 -25.77%
Benchmark32_2-4 7.85 5.26 -32.99%
Benchmark32_4-4 6.64 5.22 -21.39%
Benchmark32_8-4 7.98 6.24 -21.80%
Benchmark32_16-4 10.5 8.46 -19.43%
Benchmark32_32-4 15.8 12.8 -18.99%
Benchmark32_64-4 28.5 23.4 -17.89%
Benchmark32_128-4 55.5 46.1 -16.94%
Benchmark32_256-4 116 94.4 -18.62%
Benchmark32_512-4 234 191 -18.38%
Benchmark32_1024-4 468 386 -17.52%
Benchmark32_2048-4 936 777 -16.99%
Benchmark32_4096-4 1874 1557 -16.92%
Benchmark32_8192-4 3746 3118 -16.76%
Benchmark128_1-4 24.3 8.54 -64.86%
Benchmark128_2-4 25.2 8.90 -64.68%
Benchmark128_4-4 26.6 9.12 -65.71%
Benchmark128_8-4 29.2 9.92 -66.03%
Benchmark128_16-4 23.6 9.75 -58.69%
Benchmark128_32-4 26.7 11.5 -56.93%
Benchmark128_64-4 32.6 16.0 -50.92%
Benchmark128_128-4 44.9 26.9 -40.09%
Benchmark128_256-4 69.7 45.7 -34.43%
Benchmark128_512-4 119 83.5 -29.83%
Benchmark128_1024-4 221 164 -25.79%
Benchmark128_2048-4 418 320 -23.44%
Benchmark128_4096-4 813 632 -22.26%
Benchmark128_8192-4 1603 1257 -21.58%
1.6.2:
benchmark old ns/op new ns/op delta
Benchmark32_1-4 6.95 5.04 -27.48%
Benchmark32_2-4 7.81 5.25 -32.78%
Benchmark32_4-4 6.77 5.10 -24.67%
Benchmark32_8-4 8.03 6.22 -22.54%
Benchmark32_16-4 10.6 8.39 -20.85%
Benchmark32_32-4 15.8 12.7 -19.62%
Benchmark32_64-4 29.0 23.1 -20.34%
Benchmark32_128-4 55.4 42.6 -23.10%
Benchmark32_256-4 117 94.6 -19.15%
Benchmark32_512-4 234 191 -18.38%
Benchmark32_1024-4 468 386 -17.52%
Benchmark32_2048-4 937 777 -17.08%
Benchmark32_4096-4 1873 1557 -16.87%
Benchmark32_8192-4 3747 3120 -16.73%
Benchmark128_1-4 21.6 8.52 -60.56%
Benchmark128_2-4 22.0 8.86 -59.73%
Benchmark128_4-4 24.1 9.09 -62.28%
Benchmark128_8-4 27.3 9.96 -63.52%
Benchmark128_16-4 21.8 9.74 -55.32%
Benchmark128_32-4 24.3 11.6 -52.26%
Benchmark128_64-4 30.0 16.0 -46.67%
Benchmark128_128-4 42.2 26.9 -36.26%
Benchmark128_256-4 66.9 44.8 -33.03%
Benchmark128_512-4 116 83.5 -28.02%
Benchmark128_1024-4 219 164 -25.11%
Benchmark128_2048-4 417 319 -23.50%
Benchmark128_4096-4 812 631 -22.29%
Benchmark128_8192-4 1601 1254 -21.67%
tip, extended to 130kB:
Benchmark32_1-4 5.94 5.01 -15.66%
Benchmark32_2-4 6.71 5.23 -22.06%
Benchmark32_4-4 6.43 5.07 -21.15%
Benchmark32_8-4 7.69 6.23 -18.99%
Benchmark32_16-4 10.1 8.39 -16.93%
Benchmark32_32-4 15.5 12.7 -18.06%
Benchmark32_64-4 27.9 23.1 -17.20%
Benchmark32_128-4 57.8 42.5 -26.47%
Benchmark32_256-4 118 94.3 -20.08%
Benchmark32_512-4 237 191 -19.41%
Benchmark32_1024-4 477 387 -18.87%
Benchmark32_2048-4 957 777 -18.81%
Benchmark32_4096-4 1913 1561 -18.40%
Benchmark32_8192-4 3827 3119 -18.50%
Benchmark128_1-4 17.0 8.85 -47.94%
Benchmark128_2-4 17.9 8.89 -50.34%
Benchmark128_4-4 18.7 9.45 -49.47%
Benchmark128_8-4 20.3 9.95 -50.99%
Benchmark128_16-4 17.2 10.4 -39.53%
Benchmark128_32-4 19.4 12.2 -37.11%
Benchmark128_64-4 24.4 16.2 -33.61%
Benchmark128_128-4 34.4 26.9 -21.80%
Benchmark128_256-4 54.6 44.9 -17.77%
Benchmark128_512-4 95.1 83.6 -12.09%
Benchmark128_1024-4 180 165 -8.33%
Benchmark128_2048-4 343 322 -6.12%
Benchmark128_4096-4 668 634 -5.09%
Benchmark128_8192-4 1319 1265 -4.09%
Benchmark128_16384-4 2626 2508 -4.49%
Benchmark128_32768-4 5233 5017 -4.13%
Benchmark128_65536-4 10445 10007 -4.19%
Benchmark128_131072-4 20897 19999 -4.30%
Looks to be that at 8k, the performance increase is about a constant ~4.2%.
Maybe it is worth it to add the assembly?
Also in the 32bit, performance levels off to ~18% improvement.
@spaolacci any comment on above perf differences and whether it is worth creating the PR again?
Each commit explains the reason for the commit. Hand assembly increases performance most on small inputs (minuscule little bit on large inputs) due to better branch layout, register usage, and conditional checks. Large inputs are mostly unaffected because the loop go compiles is pretty solid.