Open FiloSottile opened 3 years ago
Here's a data point for the M1. You get a ~30% decrease, go binary built with 1.17.2.
optimized:
❯ ~/htdocs/go/bin/go test -bench=.
goos: darwin
goarch: arm64
pkg: example/md5
BenchmarkHash8Bytes-8 10756149 110.7 ns/op 72.25 MB/s
BenchmarkHash64-8 6370714 191.2 ns/op 334.65 MB/s
BenchmarkHash128-8 4167610 288.5 ns/op 443.69 MB/s
BenchmarkHash256-8 2517034 475.5 ns/op 538.41 MB/s
BenchmarkHash512-8 1397494 870.0 ns/op 588.50 MB/s
BenchmarkHash1K-8 720913 1638 ns/op 625.02 MB/s
BenchmarkHash8K-8 96699 12347 ns/op 663.47 MB/s
BenchmarkHash1M-8 760 1578441 ns/op 664.31 MB/s
BenchmarkHash8M-8 86 12689257 ns/op 661.08 MB/s
BenchmarkHash8BytesUnaligned-8 10780814 111.5 ns/op 71.76 MB/s
BenchmarkHash1KUnaligned-8 719868 1635 ns/op 626.49 MB/s
BenchmarkHash8KUnaligned-8 96816 12342 ns/op 663.75 MB/s
PASS
ok example/md5 17.895s
generic:
❯ ~/htdocs/go/bin/go test -bench=.
goos: darwin
goarch: arm64
pkg: example/md5
BenchmarkHash8Bytes-8 6348889 157.7 ns/op 50.74 MB/s
BenchmarkHash64-8 4332680 276.5 ns/op 231.47 MB/s
BenchmarkHash128-8 2835121 422.6 ns/op 302.88 MB/s
BenchmarkHash256-8 1677094 714.6 ns/op 358.24 MB/s
BenchmarkHash512-8 897469 1300 ns/op 393.85 MB/s
BenchmarkHash1K-8 476378 2470 ns/op 414.65 MB/s
BenchmarkHash8K-8 63402 18837 ns/op 434.89 MB/s
BenchmarkHash1M-8 499 2394526 ns/op 437.91 MB/s
BenchmarkHash8M-8 57 19171187 ns/op 437.56 MB/s
BenchmarkHash8BytesUnaligned-8 7573111 158.0 ns/op 50.65 MB/s
BenchmarkHash1KUnaligned-8 476732 2478 ns/op 413.31 MB/s
BenchmarkHash8KUnaligned-8 63505 18832 ns/op 434.99 MB/s
PASS
ok example/md5 16.650s
It's not currently within 80% on this machine either.
goos: linux
goarch: amd64
pkg: std/crypto/md5
cpu: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
name old speed new speed delta
Hash8Bytes-12 84.7MB/s ± 1% 61.7MB/s ± 1% -27.17% (p=0.008 n=5+5)
Hash64-12 402MB/s ± 1% 280MB/s ± 2% -30.32% (p=0.008 n=5+5)
Hash128-12 562MB/s ± 1% 386MB/s ± 1% -31.31% (p=0.008 n=5+5)
Hash256-12 699MB/s ± 1% 471MB/s ± 1% -32.62% (p=0.008 n=5+5)
Hash512-12 790MB/s ± 1% 531MB/s ± 1% -32.75% (p=0.008 n=5+5)
Hash1K-12 848MB/s ± 1% 556MB/s ± 4% -34.38% (p=0.008 n=5+5)
Hash8K-12 908MB/s ± 1% 602MB/s ± 0% -33.70% (p=0.008 n=5+5)
Hash1M-12 923MB/s ± 0% 607MB/s ± 1% -34.28% (p=0.008 n=5+5)
Hash8M-12 917MB/s ± 1% 607MB/s ± 1% -33.78% (p=0.008 n=5+5)
Hash8BytesUnaligned-12 85.0MB/s ± 1% 62.3MB/s ± 1% -26.78% (p=0.008 n=5+5)
Hash1KUnaligned-12 845MB/s ± 2% 563MB/s ± 1% -33.38% (p=0.008 n=5+5)
Hash8KUnaligned-12 912MB/s ± 0% 592MB/s ± 2% -35.17% (p=0.008 n=5+5)
I poked at this a bit as a palate cleanser after working on package net/syscall. There are several things going wrong here.
The most important by is that the compiler takes a+b+c+d
and generates a+(b+(c+d))
. The hand-written assembly does (a+b)+(c+d)
. The assembly version is amenable to superscalar processing; the compiler's code is not. This will require compiler improvements. Filed #49331.
Also, the register allocator puts too many things in registers in the outer loop, so by the time it gets to the heavy numeric lifting, it doesn't have enough to work with. This is fixable by massaging the code a bit in some (admittedly weird) ways. See https://gist.github.com/josharian/42e66bf022f32da3da0a9b1bdf0a974b. Without filing the other issues mentioned here, this rewrite itself doesn't improve performance.
Also, the CSE pass combines a bunch of the xNN memory loads. These then require registers, and the register allocator decides to spill them rather than reload from memory. Filed #49332.
I'm done investigating this for now. The next step is to fix #49331, or find some clever way to write the code to force the compiler to generate more balanced computation trees.
Since this issue was filed, two new assembly implementations have been added (loong64 and riscv64).
crypto/md5 has a series of assembly cores, all looking pretty much the same and very similar to the generic code.
The generic code looks like something that should intrinsify well, and the assembly doesn't use any special platform instructions as far as I can tell.
https://github.com/golang/go/blob/2ebe77a2fda1ee9ff6fd9a3e08933ad1ebaea039/src/crypto/md5/md5block.go
MD5 is broken and obsolete, so ideally we'd reduce the complexity budget we spend on it over time.
If we can get the generic code within 80% of the performance of the assembly, I'd be happy to remove the cores.