Closed harshavardhana closed 8 years ago
$ go run cpu.go
Name: Intel(R) Atom(TM) CPU C2750 @ 2.40GHz
PhysicalCores: 8
ThreadsPerCore: 1
LogicalCores: 8
Family 6 Model: 77
Features: CMOV,NX,MMX,MMXEXT,SSE,SSE2,SSE3,SSSE3,SSE4.1,SSE4.2,POPCNT,AESNI,CLMUL,RDRAND,ERMS,RDTSCP,CX16
Cacheline bytes: 64
L1 Data Cache: 24576 bytes
L1 Instruction Cache: 24576 bytes
L2 Cache: 1048576 bytes
L3 Cache: -1 bytes
We have Streaming SIMD Extensions
Do we have comparable benchmark tests for the (optimized) erasure code? It would be interesting to see if it show a similar performance characteristic.
Do we have comparable benchmark tests for the (optimized) erasure code? It would be interesting to see if it show a similar performance characteristic.
Will check klauspost implementation benchmarks.
reedsolomon seems to be fine.
minio@minio1:~/gopath/src/github.com/klauspost/reedsolomon$ go test -run=NONE -bench .
PASS
BenchmarkEncode10x2x10000-8 10000 212886 ns/op 469.73 MB/s
BenchmarkEncode100x20x10000-8 300 4628240 ns/op 216.06 MB/s
BenchmarkEncode17x3x1M-8 200 9053031 ns/op 1969.04 MB/s
BenchmarkEncode10x4x16M-8 10 140631888 ns/op 1192.99 MB/s
BenchmarkEncode5x2x1M-8 1000 1917165 ns/op 2734.70 MB/s
BenchmarkEncode10x2x1M-8 300 3796078 ns/op 2762.26 MB/s
BenchmarkEncode10x4x1M-8 200 7565793 ns/op 1385.94 MB/s
BenchmarkEncode50x20x1M-8 10 187103905 ns/op 280.21 MB/s
BenchmarkEncode17x3x16M-8 10 179422404 ns/op 1589.62 MB/s
BenchmarkVerify10x2x10000-8 10000 291451 ns/op 343.11 MB/s
BenchmarkVerify50x5x50000-8 500 3627155 ns/op 1378.49 MB/s
BenchmarkVerify10x2x1M-8 500 3667820 ns/op 2858.85 MB/s
BenchmarkVerify5x2x1M-8 500 2538149 ns/op 2065.63 MB/s
BenchmarkVerify10x4x1M-8 200 6931193 ns/op 1512.84 MB/s
BenchmarkVerify50x20x1M-8 10 138689922 ns/op 378.03 MB/s
BenchmarkVerify10x4x16M-8 10 157813687 ns/op 1063.10 MB/s
BenchmarkStreamEncode10x2x10000-8 100 19175486 ns/op 5.21 MB/s
BenchmarkStreamEncode100x20x10000-8 10 256689637 ns/op 3.90 MB/s
BenchmarkStreamEncode17x3x1M-8 30 45092722 ns/op 395.31 MB/s
BenchmarkStreamEncode10x4x16M-8 5 208510488 ns/op 804.62 MB/s
BenchmarkStreamEncode5x2x1M-8 100 15052590 ns/op 348.30 MB/s
BenchmarkStreamEncode10x2x1M-8 50 25483643 ns/op 411.47 MB/s
BenchmarkStreamEncode10x4x1M-8 50 31847162 ns/op 329.25 MB/s
BenchmarkStreamEncode50x20x1M-8 5 305278924 ns/op 171.74 MB/s
BenchmarkStreamEncode17x3x16M-8 5 300076889 ns/op 950.47 MB/s
BenchmarkStreamVerify10x2x10000-8 100 18801492 ns/op 5.32 MB/s
BenchmarkStreamVerify50x5x50000-8 20 75257335 ns/op 66.44 MB/s
BenchmarkStreamVerify10x2x1M-8 100 23891476 ns/op 438.89 MB/s
BenchmarkStreamVerify5x2x1M-8 100 14956139 ns/op 350.55 MB/s
BenchmarkStreamVerify10x4x1M-8 50 27164538 ns/op 386.01 MB/s
BenchmarkStreamVerify50x20x1M-8 10 127975468 ns/op 409.68 MB/s
BenchmarkStreamVerify10x4x16M-8 30 48226324 ns/op 3478.85 MB/s
ok github.com/klauspost/reedsolomon 175.058s
Interesting thing to observe here is that native Go code itself gives low performance which provides better performance on my laptop itself.
Atom results native Go.
$ go test -run=NONE -bench .
PASS
BenchmarkHash64-8 300000 4617 ns/op 13.86 MB/s
BenchmarkHash128-8 300000 4204 ns/op 30.45 MB/s
BenchmarkHash1K-8 50000 26182 ns/op 39.11 MB/s
BenchmarkHash8K-8 10000 198543 ns/op 41.26 MB/s
BenchmarkHash32K-8 2000 789564 ns/op 41.50 MB/s
BenchmarkHash128K-8 500 3151154 ns/op 41.59 MB/s
ok github.com/minio/blake2b-simd 9.912s
Atom results with SSE3.
$ go test -run=NONE -bench .
PASS
BenchmarkHash64-8 500000 3776 ns/op 16.95 MB/s
BenchmarkHash128-8 500000 3350 ns/op 38.20 MB/s
BenchmarkHash1K-8 100000 18978 ns/op 53.96 MB/s
BenchmarkHash8K-8 10000 139750 ns/op 58.62 MB/s
BenchmarkHash32K-8 3000 553842 ns/op 59.16 MB/s
BenchmarkHash128K-8 1000 2209830 ns/op 59.31 MB/s
ok github.com/minio/blake2b-simd 11.325s
Same tests on my laptop
Go native
$ go test -run=NONE -bench .
PASS
BenchmarkHash64-4 1000000 1330 ns/op 48.12 MB/s
BenchmarkHash128-4 1000000 1063 ns/op 120.36 MB/s
BenchmarkHash1K-4 200000 5532 ns/op 185.10 MB/s
BenchmarkHash8K-4 50000 36656 ns/op 223.48 MB/s
BenchmarkHash32K-4 10000 142393 ns/op 230.12 MB/s
BenchmarkHash128K-4 3000 578957 ns/op 226.39 MB/s
ok github.com/minio/blake2b-simd 9.041s
$
With SSE3
$ go test -run=NONE -bench .
PASS
BenchmarkHash64-4 2000000 654 ns/op 97.72 MB/s
BenchmarkHash128-4 3000000 864 ns/op 148.03 MB/s
BenchmarkHash1K-4 500000 3068 ns/op 333.70 MB/s
BenchmarkHash8K-4 100000 14011 ns/op 584.68 MB/s
BenchmarkHash32K-4 30000 77094 ns/op 425.04 MB/s
BenchmarkHash128K-4 10000 234364 ns/op 559.27 MB/s
ok github.com/minio/blake2b-simd 14.377s
blake2s gives better results than blake2b (twice as quick). Seems to indicate that there's some kind of performance penalty on Atom when executing SSE with 64-bit operands
minio@minio1:~/fwessels/BLAKE2/b2sum$ time ./b2sum -a blake2s 250mb.bin
7e690b5e9dbebab45f4267809e93ec54da2003c7e45063688f2abcc4a7bdc11e 250mb.bin
real 0m3.118s
user 0m3.004s
sys 0m0.108s
minio@minio1:~/fwessels/BLAKE2/b2sum$ time ./b2sum -a blake2b 250mb.bin
82342a038870eb9579793c5a68f94883299c8eb1c9eff089a720f81f3d4baf03cfc30d68fd72981e112b4b69841e94fd01d1119e20e2fcaa8eea72353996e724 250mb.bin
real 0m6.144s
user 0m6.008s
sys 0m0.128s
Using the github.com/codahale/blake2b implementation shows a similar performance characteristic:
$ go run main-codahale.go
Initializing buffer...
Starting measurements...
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.349561138s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b( go) 25.807073517s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.347692516s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b( go) 25.807968759s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.347710469s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b( go) 25.808912915s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.348954836s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b( go) 25.80698756s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b(coda-64) 17.347304175s
1a6fba54d0e8b90c8a0ee0dd60966eccc48cdd575a692f42ff58030e5bc3a5430443724615a384426e23018cc58cf057a57945f2d75a602e9e1d546d0026b817
blake2b( go) 25.807094439s
Speed increase: 1.5
Interestingly enough the blake2b reference implementation (plain C) on Atom gives better performance than its SSE optimized counter part.
$ time ./b2sum -a blake2b 250mb.bin
82342a038870eb9579793c5a68f94883299c8eb1c9eff089a720f81f3d4baf03cfc30d68fd72981e112b4b69841e94fd01d1119e20e2fcaa8eea72353996e724 250mb.bin
real 0m2.279s (108MB/sec)
$ time ./b2sum-sse -a blake2b 250mb.bin
82342a038870eb9579793c5a68f94883299c8eb1c9eff089a720f81f3d4baf03cfc30d68fd72981e112b4b69841e94fd01d1119e20e2fcaa8eea72353996e724 250mb.bin
real 0m6.146s (40MB/sec)
In Golang, blake2b-simd is still getting a 1.5x speed up (59MB/sec) compared to the pure go (non-assembly) version (40MB/sec).
However the test above indicates that we could potentially double this by examining where the blake2b reference implementation gets its performance from, and (re)implement this in go assembly. The question is whether this is worth the effort or not.
Conclusion seems to be that the SSE implementation of Atom helps to a little extent but is not up to par with the implementations on other chip sets.
Closing the issue for now. We can always spend the time to get up to par with the plain C reference implementation when the need arises (2x improvement).