crypto/md5: remove assembly cores

FiloSottile commented 3 years ago

crypto/md5 has a series of assembly cores, all looking pretty much the same and very similar to the generic code.

The generic code looks like something that should intrinsify well, and the assembly doesn't use any special platform instructions as far as I can tell.

https://github.com/golang/go/blob/2ebe77a2fda1ee9ff6fd9a3e08933ad1ebaea039/src/crypto/md5/md5block.go

MD5 is broken and obsolete, so ideally we'd reduce the complexity budget we spend on it over time.

If we can get the generic code within 80% of the performance of the assembly, I'd be happy to remove the cores.

andig commented 3 years ago

Here's a data point for the M1. You get a ~30% decrease, go binary built with 1.17.2.

optimized:

❯ ~/htdocs/go/bin/go test -bench=.
goos: darwin
goarch: arm64
pkg: example/md5
BenchmarkHash8Bytes-8               10756149           110.7 ns/op    72.25 MB/s
BenchmarkHash64-8                    6370714           191.2 ns/op   334.65 MB/s
BenchmarkHash128-8                   4167610           288.5 ns/op   443.69 MB/s
BenchmarkHash256-8                   2517034           475.5 ns/op   538.41 MB/s
BenchmarkHash512-8                   1397494           870.0 ns/op   588.50 MB/s
BenchmarkHash1K-8                     720913          1638 ns/op     625.02 MB/s
BenchmarkHash8K-8                      96699         12347 ns/op     663.47 MB/s
BenchmarkHash1M-8                        760       1578441 ns/op     664.31 MB/s
BenchmarkHash8M-8                         86      12689257 ns/op     661.08 MB/s
BenchmarkHash8BytesUnaligned-8      10780814           111.5 ns/op    71.76 MB/s
BenchmarkHash1KUnaligned-8            719868          1635 ns/op     626.49 MB/s
BenchmarkHash8KUnaligned-8             96816         12342 ns/op     663.75 MB/s
PASS
ok      example/md5 17.895s

generic:

❯ ~/htdocs/go/bin/go test -bench=.
goos: darwin
goarch: arm64
pkg: example/md5
BenchmarkHash8Bytes-8                6348889           157.7 ns/op    50.74 MB/s
BenchmarkHash64-8                    4332680           276.5 ns/op   231.47 MB/s
BenchmarkHash128-8                   2835121           422.6 ns/op   302.88 MB/s
BenchmarkHash256-8                   1677094           714.6 ns/op   358.24 MB/s
BenchmarkHash512-8                    897469          1300 ns/op     393.85 MB/s
BenchmarkHash1K-8                     476378          2470 ns/op     414.65 MB/s
BenchmarkHash8K-8                      63402         18837 ns/op     434.89 MB/s
BenchmarkHash1M-8                        499       2394526 ns/op     437.91 MB/s
BenchmarkHash8M-8                         57      19171187 ns/op     437.56 MB/s
BenchmarkHash8BytesUnaligned-8       7573111           158.0 ns/op    50.65 MB/s
BenchmarkHash1KUnaligned-8            476732          2478 ns/op     413.31 MB/s
BenchmarkHash8KUnaligned-8             63505         18832 ns/op     434.99 MB/s
PASS
ok      example/md5 16.650s

cespare commented 3 years ago

It's not currently within 80% on this machine either.

goos: linux
goarch: amd64
pkg: std/crypto/md5
cpu: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

name                    old speed      new speed      delta
Hash8Bytes-12           84.7MB/s ± 1%  61.7MB/s ± 1%  -27.17%  (p=0.008 n=5+5)
Hash64-12                402MB/s ± 1%   280MB/s ± 2%  -30.32%  (p=0.008 n=5+5)
Hash128-12               562MB/s ± 1%   386MB/s ± 1%  -31.31%  (p=0.008 n=5+5)
Hash256-12               699MB/s ± 1%   471MB/s ± 1%  -32.62%  (p=0.008 n=5+5)
Hash512-12               790MB/s ± 1%   531MB/s ± 1%  -32.75%  (p=0.008 n=5+5)
Hash1K-12                848MB/s ± 1%   556MB/s ± 4%  -34.38%  (p=0.008 n=5+5)
Hash8K-12                908MB/s ± 1%   602MB/s ± 0%  -33.70%  (p=0.008 n=5+5)
Hash1M-12                923MB/s ± 0%   607MB/s ± 1%  -34.28%  (p=0.008 n=5+5)
Hash8M-12                917MB/s ± 1%   607MB/s ± 1%  -33.78%  (p=0.008 n=5+5)
Hash8BytesUnaligned-12  85.0MB/s ± 1%  62.3MB/s ± 1%  -26.78%  (p=0.008 n=5+5)
Hash1KUnaligned-12       845MB/s ± 2%   563MB/s ± 1%  -33.38%  (p=0.008 n=5+5)
Hash8KUnaligned-12       912MB/s ± 0%   592MB/s ± 2%  -35.17%  (p=0.008 n=5+5)

josharian commented 3 years ago

I poked at this a bit as a palate cleanser after working on package net/syscall. There are several things going wrong here.

The most important by is that the compiler takes a+b+c+d and generates a+(b+(c+d)). The hand-written assembly does (a+b)+(c+d). The assembly version is amenable to superscalar processing; the compiler's code is not. This will require compiler improvements. Filed #49331.

Also, the register allocator puts too many things in registers in the outer loop, so by the time it gets to the heavy numeric lifting, it doesn't have enough to work with. This is fixable by massaging the code a bit in some (admittedly weird) ways. See https://gist.github.com/josharian/42e66bf022f32da3da0a9b1bdf0a974b. Without filing the other issues mentioned here, this rewrite itself doesn't improve performance.

Also, the CSE pass combines a bunch of the xNN memory loads. These then require registers, and the register allocator decides to spill them rather than reload from memory. Filed #49332.

I'm done investigating this for now. The next step is to fix #49331, or find some clever way to write the code to force the compiler to generate more balanced computation trees.

magical commented 1 month ago

Since this issue was filed, two new assembly implementations have been added (loong64 and riscv64).

golang / go

crypto/md5: remove assembly cores #49248