blake2b: SSE support on Atom is showing unexpected performance metrics

harshavardhana commented 8 years ago

$ go test -run=NONE -bench .
PASS
BenchmarkHash64-8        500000          3775 ns/op      16.95 MB/s
BenchmarkHash128-8       500000          3354 ns/op      38.16 MB/s
BenchmarkHash1K-8        100000         18982 ns/op      53.94 MB/s
BenchmarkHash8K-8         10000        139762 ns/op      58.61 MB/s
BenchmarkHash32K-8         3000        553868 ns/op      59.16 MB/s
BenchmarkHash128K-8        1000       2210114 ns/op      59.31 MB/s
ok      github.com/minio/blake2b-simd    11.328s

harshavardhana commented 8 years ago

$ go run cpu.go 
Name: Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
PhysicalCores: 8
ThreadsPerCore: 1
LogicalCores: 8
Family 6 Model: 77
Features: CMOV,NX,MMX,MMXEXT,SSE,SSE2,SSE3,SSSE3,SSE4.1,SSE4.2,POPCNT,AESNI,CLMUL,RDRAND,ERMS,RDTSCP,CX16
Cacheline bytes: 64
L1 Data Cache: 24576 bytes
L1 Instruction Cache: 24576 bytes
L2 Cache: 1048576 bytes
L3 Cache: -1 bytes
We have Streaming SIMD Extensions

fwessels commented 8 years ago

Do we have comparable benchmark tests for the (optimized) erasure code? It would be interesting to see if it show a similar performance characteristic.

harshavardhana commented 8 years ago

Do we have comparable benchmark tests for the (optimized) erasure code? It would be interesting to see if it show a similar performance characteristic.

Will check klauspost implementation benchmarks.

harshavardhana commented 8 years ago

reedsolomon seems to be fine.

minio@minio1:~/gopath/src/github.com/klauspost/reedsolomon$ go test -run=NONE -bench .
PASS
BenchmarkEncode10x2x10000-8            10000        212886 ns/op     469.73 MB/s
BenchmarkEncode100x20x10000-8            300       4628240 ns/op     216.06 MB/s
BenchmarkEncode17x3x1M-8                 200       9053031 ns/op    1969.04 MB/s
BenchmarkEncode10x4x16M-8                 10     140631888 ns/op    1192.99 MB/s
BenchmarkEncode5x2x1M-8                 1000       1917165 ns/op    2734.70 MB/s
BenchmarkEncode10x2x1M-8                 300       3796078 ns/op    2762.26 MB/s
BenchmarkEncode10x4x1M-8                 200       7565793 ns/op    1385.94 MB/s
BenchmarkEncode50x20x1M-8                 10     187103905 ns/op     280.21 MB/s
BenchmarkEncode17x3x16M-8                 10     179422404 ns/op    1589.62 MB/s
BenchmarkVerify10x2x10000-8            10000        291451 ns/op     343.11 MB/s
BenchmarkVerify50x5x50000-8              500       3627155 ns/op    1378.49 MB/s
BenchmarkVerify10x2x1M-8                 500       3667820 ns/op    2858.85 MB/s
BenchmarkVerify5x2x1M-8                  500       2538149 ns/op    2065.63 MB/s
BenchmarkVerify10x4x1M-8                 200       6931193 ns/op    1512.84 MB/s
BenchmarkVerify50x20x1M-8                 10     138689922 ns/op     378.03 MB/s
BenchmarkVerify10x4x16M-8                 10     157813687 ns/op    1063.10 MB/s
BenchmarkStreamEncode10x2x10000-8        100      19175486 ns/op       5.21 MB/s
BenchmarkStreamEncode100x20x10000-8       10     256689637 ns/op       3.90 MB/s
BenchmarkStreamEncode17x3x1M-8            30      45092722 ns/op     395.31 MB/s
BenchmarkStreamEncode10x4x16M-8            5     208510488 ns/op     804.62 MB/s
BenchmarkStreamEncode5x2x1M-8            100      15052590 ns/op     348.30 MB/s
BenchmarkStreamEncode10x2x1M-8            50      25483643 ns/op     411.47 MB/s
BenchmarkStreamEncode10x4x1M-8            50      31847162 ns/op     329.25 MB/s
BenchmarkStreamEncode50x20x1M-8            5     305278924 ns/op     171.74 MB/s
BenchmarkStreamEncode17x3x16M-8            5     300076889 ns/op     950.47 MB/s
BenchmarkStreamVerify10x2x10000-8        100      18801492 ns/op       5.32 MB/s
BenchmarkStreamVerify50x5x50000-8         20      75257335 ns/op      66.44 MB/s
BenchmarkStreamVerify10x2x1M-8           100      23891476 ns/op     438.89 MB/s
BenchmarkStreamVerify5x2x1M-8            100      14956139 ns/op     350.55 MB/s
BenchmarkStreamVerify10x4x1M-8            50      27164538 ns/op     386.01 MB/s
BenchmarkStreamVerify50x20x1M-8           10     127975468 ns/op     409.68 MB/s
BenchmarkStreamVerify10x4x16M-8           30      48226324 ns/op    3478.85 MB/s
ok      github.com/klauspost/reedsolomon    175.058s

Interesting thing to observe here is that native Go code itself gives low performance which provides better performance on my laptop itself.

Atom results native Go.

$ go test -run=NONE -bench .
PASS
BenchmarkHash64-8     300000          4617 ns/op      13.86 MB/s
BenchmarkHash128-8    300000          4204 ns/op      30.45 MB/s
BenchmarkHash1K-8      50000         26182 ns/op      39.11 MB/s
BenchmarkHash8K-8      10000        198543 ns/op      41.26 MB/s
BenchmarkHash32K-8      2000        789564 ns/op      41.50 MB/s
BenchmarkHash128K-8      500       3151154 ns/op      41.59 MB/s
ok      github.com/minio/blake2b-simd   9.912s

Atom results with SSE3.

$ go test -run=NONE -bench .
PASS
BenchmarkHash64-8     500000          3776 ns/op      16.95 MB/s
BenchmarkHash128-8    500000          3350 ns/op      38.20 MB/s
BenchmarkHash1K-8     100000         18978 ns/op      53.96 MB/s
BenchmarkHash8K-8      10000        139750 ns/op      58.62 MB/s
BenchmarkHash32K-8      3000        553842 ns/op      59.16 MB/s
BenchmarkHash128K-8     1000       2209830 ns/op      59.31 MB/s
ok      github.com/minio/blake2b-simd   11.325s

Same tests on my laptop

Go native

$ go test -run=NONE -bench .
PASS
BenchmarkHash64-4    1000000          1330 ns/op      48.12 MB/s
BenchmarkHash128-4   1000000          1063 ns/op     120.36 MB/s
BenchmarkHash1K-4     200000          5532 ns/op     185.10 MB/s
BenchmarkHash8K-4      50000         36656 ns/op     223.48 MB/s
BenchmarkHash32K-4     10000        142393 ns/op     230.12 MB/s
BenchmarkHash128K-4     3000        578957 ns/op     226.39 MB/s
ok      github.com/minio/blake2b-simd   9.041s
$

With SSE3

$ go test -run=NONE -bench .
PASS
BenchmarkHash64-4    2000000           654 ns/op      97.72 MB/s
BenchmarkHash128-4   3000000           864 ns/op     148.03 MB/s
BenchmarkHash1K-4     500000          3068 ns/op     333.70 MB/s
BenchmarkHash8K-4     100000         14011 ns/op     584.68 MB/s
BenchmarkHash32K-4     30000         77094 ns/op     425.04 MB/s
BenchmarkHash128K-4    10000        234364 ns/op     559.27 MB/s
ok      github.com/minio/blake2b-simd   14.377s

fwessels commented 8 years ago

Test with the official BLAKE implementation

blake2s gives better results than blake2b (twice as quick). Seems to indicate that there's some kind of performance penalty on Atom when executing SSE with 64-bit operands

minio@minio1:~/fwessels/BLAKE2/b2sum$ time ./b2sum -a blake2s 250mb.bin 
7e690b5e9dbebab45f4267809e93ec54da2003c7e45063688f2abcc4a7bdc11e  250mb.bin

real    0m3.118s
user    0m3.004s
sys    0m0.108s
minio@minio1:~/fwessels/BLAKE2/b2sum$ time ./b2sum -a blake2b 250mb.bin 
82342a038870eb9579793c5a68f94883299c8eb1c9eff089a720f81f3d4baf03cfc30d68fd72981e112b4b69841e94fd01d1119e20e2fcaa8eea72353996e724  250mb.bin

real    0m6.144s
user    0m6.008s
sys    0m0.128s

fwessels commented 8 years ago

Using the github.com/codahale/blake2b implementation shows a similar performance characteristic:

$ go run main-codahale.go 
Initializing buffer...
Starting measurements...
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.349561138s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(     go) 25.807073517s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.347692516s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(     go) 25.807968759s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.347710469s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(     go) 25.808912915s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.348954836s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(     go) 25.80698756s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.347304175s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(     go) 25.807094439s
Speed increase: 1.5

fwessels commented 8 years ago

Interestingly enough the blake2b reference implementation (plain C) on Atom gives better performance than its SSE optimized counter part.

$ time ./b2sum -a blake2b 250mb.bin 
82342a038870eb9579793c5a68f94883299c8eb1c9eff089a720f81f3d4baf03cfc30d68fd72981e112b4b69841e94fd01d1119e20e2fcaa8eea72353996e724  250mb.bin
real    0m2.279s (108MB/sec)

$ time ./b2sum-sse -a blake2b 250mb.bin 
82342a038870eb9579793c5a68f94883299c8eb1c9eff089a720f81f3d4baf03cfc30d68fd72981e112b4b69841e94fd01d1119e20e2fcaa8eea72353996e724  250mb.bin

real    0m6.146s  (40MB/sec)

In Golang, blake2b-simd is still getting a 1.5x speed up (59MB/sec) compared to the pure go (non-assembly) version (40MB/sec).

However the test above indicates that we could potentially double this by examining where the blake2b reference implementation gets its performance from, and (re)implement this in go assembly. The question is whether this is worth the effort or not.

Conclusion seems to be that the SSE implementation of Atom helps to a little extent but is not up to par with the implementations on other chip sets.

fwessels commented 8 years ago

Closing the issue for now. We can always spend the time to get up to par with the plain C reference implementation when the need arises (2x improvement).

minio / blake2b-simd

blake2b: SSE support on Atom is showing unexpected performance metrics #11

Test with the official BLAKE implementation