Closed goarchit closed 7 years ago
Per chance have you run speed comparisons vs. the other common implementation of blake2b?
No there are no intentions of doing that @goarchit - happy to see some results from community if any.
One data point: Hashing a 5.8GB tarball on a Jetway (SSE2 only, quad core celeron - SSD based - hopefully the low-end of PC's these days, at least CPU wise):
Minio: / First run - nothing in OS cache / real 1m11.658s user 1m8.976s sys 0m2.680s
Codahale: real 1m8.054s user 1m6.611s sys 0m1.954s
Minio: / Second run - sanity check, just in case OS cacheing played a roll / real 1m10.302s user 1m8.865s sys 0m1.976s
Code used:
import (
"fmt"
"github.com/minio/blake2b-simd"
// "github.com/codahale/blake2"
"io"
"os"
)
func ComputeBlake2B(filePath string) ([]byte, error) {
var result []byte
file, err := os.Open(filePath)
if err != nil {
return result, err
}
defer file.Close()
// hash := blake2.NewBlake2B()
hash, err := blake2b.New(nil)
if err != nil {
fmt.Println("Blake2B Error: ", err)
}
if _, err := io.Copy(hash, file); err != nil {
return result, err
}
return hash.Sum(result), nil
}
func main() {
if b, err := ComputeBlake2B("/home/everything.tar"); err != nil {
fmt.Printf("Err: %v", err)
} else {
fmt.Printf("/home/everything.tar Blake2b checksum is: %x", b)
}
}
Sorry about the formatting, not sure why pasting the code between ticks (e.g. "insert code") didn't work.
Thanks for the benchmark, looks pretty comparable between the two.
Using more powerful CPU with AVX or AVX2 would double hashing speed.
Ran a similar test (6.2GB tarball) on a sample of the other extreme: A dual socket Xeon with a combined 28 cores and 56 thread capability, L3 cache, AVX and AVX2. Again off a (likely faster) SSD.
Mimio: real 0m16.623s user 0m13.060s sys 0m2.640s
Codahale: real 0m29.193s user 0m24.800s sys 0m2.980s
Mimio: (sanity check): real 0m16.443s user 0m13.170s sys 0m2.560s
That server was moderately busy, so I idled it and reran:
Mimio: real 0m10.744s user 0m9.480s sys 0m1.270s
Codahale: real 0m17.705s user 0m16.670s sys 0m1.040s
Bottom line: Mimio is approximately the same speed on low end CPUs, and takes ~40% less resources on a high end CPU system. Nice Job! Pretty clear which codebase our project will be using, shy of any surprises benchmarking some other algos.
update: Just for grins I timed the golang/x/crypto/Blake2B package: real 2m19.175s user 2m18.290s sys 0m0.990s
update: Just for grins I timed the golang/x/crypto/Blake2B package: real 2m19.175s user 2m18.290s sys 0m0.990s
This seems like odd in-fact crypto/blake2b is faster than minio-blake2b implementation - are you using the latest commit from - https://github.com/golang/crypto/tree/master/blake2b ?
// cc @aead
Did some tests locally golang/x/crypto/blake2b is giving quite close to minio/blake2b-simd or better.
No... I pulled from the path referenced...
Built using github.coim/golang/crypto/blake2b (after a go get), idled the box to recreate the original environment, and got a time of: real 2m3.855s user 2m2.150s sys 0m1.780s
Very similar to what I previously reported.
@goarchit
Can you provide the benchmark results of go test -bench=Benchmark
on your machine(s)?
On my i7-6500U (linux/amd64) the codahale implementation is about 400 MB/s slower than x/crypto/blake2b - which is not surprising, because codahale relies on the "official" C implementation (using "just" SSE4.1) vs. the AVX2 code of /x/crypto
Did some tests locally golang/x/crypto/blake2b is giving quite close to minio/blake2b-simd or better.
Maybe there are some other side effects, but on my machines the I see the same as you.
Per your request:
BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s PASS ok github.com/golang/crypto/blake2b 7.705s
And just to be complete, the last cpu's output from a cat /proc/cpuinfo
processor : 55 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz stepping : 1 microcode : 0xb00001d cpu MHz : 2401.000 cache size : 35840 KB physical id : 1 siblings : 28 core id : 14 cpu cores : 14 apicid : 61 initial apicid : 61 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc bugs : bogomips : 4800.57 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s
what is the Go version you are using @goarchit ?
BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s
The performance is such low - that's something unexpected. On this kind of machine I would expect values about >= 800 MB/s... The CPU supports AVX2 so probably the AVX2 code is used (as long go version >= 1.7)
Unfortunately I don't have access to this kind of CPU - so I cannot reproduce or test this on my own. Maybe (this is a guess) the LOAD_MSG_AVX2 macro causes some trouble (maybe the the Xeon has problems with the AVX2-SSE mixing code), but as mentioned, I cannot test this...
@goarchit
Can you replace line 10
in blake2bAVX2_amd64.s
with useAVX2 = supportsAVX2() && false
and run the benchmark again?
@harshavardhana go version go1.7.4 linux/amd64
@aead Presume you meant black2bAVX2_amd64.go... similar results: BenchmarkWrite128-56 1000000 2028 ns/op 63.10 MB/s BenchmarkWrite1K-56 100000 15764 ns/op 64.96 MB/s BenchmarkSum128-56 1000000 2047 ns/op 62.53 MB/s BenchmarkSum1K-56 100000 15836 ns/op 64.66 MB/s PASS ok github.com/golang/crypto/blake2b 7.616s
@aead E-mail me and I can work out access for you to the box. Its normally busy doing some CPU mining and other stuff, but I can manually idle it for you while you test. Would just need to coordinate and keep windows reasonable (an hour or two at a time, not days or weeks)
Was able to reproduce this problem on our Xeon hardware
$ go test -bench=Benchmark
BenchmarkWrite128-12 500000 2243 ns/op 57.05 MB/s
BenchmarkWrite1K-12 100000 16225 ns/op 63.11 MB/s
BenchmarkSum128-12 1000000 2106 ns/op 60.77 MB/s
BenchmarkSum1K-12 100000 16287 ns/op 62.87 MB/s
PASS
ok golang.org/x/crypto/blake2b 6.885s
v/s blake2b-simd
BenchmarkComparisonMD5-12 1000 1741000 ns/op 602.28 MB/s
BenchmarkComparisonSHA1-12 1000 1393112 ns/op 752.69 MB/s
BenchmarkComparisonSHA256-12 500 3006689 ns/op 348.75 MB/s
BenchmarkComparisonSHA512-12 500 3440423 ns/op 304.78 MB/s
BenchmarkComparisonBlake2B-12 1000 1229740 ns/op 852.68 MB/s
BenchmarkSize64-12 3000000 654 ns/op 97.80 MB/s
BenchmarkSize128-12 3000000 511 ns/op 250.45 MB/s
BenchmarkSize1K-12 1000000 2014 ns/op 508.25 MB/s
BenchmarkSize8K-12 100000 10483 ns/op 781.43 MB/s
BenchmarkSize32K-12 50000 35959 ns/op 911.25 MB/s
BenchmarkSize128K-12 10000 143302 ns/op 914.65 MB/s
Looks like a bug @aead
@harshavardhana Switching to AVX code doesn't change anything - so my guess about the AVX2-SSE switching seems to be wrong. Maybe you can run the benchmark for the different SIMD-instructions (SSE4.1, AVX, AVX2)? I'll take a deeper look into blake2b-simd assembly and search for significant differences in mem/cache accesses, etc.
Maybe creating an issue at golang is a good idea?!
@goarchit
Okay, now I'm really blankly...
Thanks for the offer, I'll will do some investigation on this and come back to you! :smiley:
Plain blake2b without any optimizations using golang.org/x/crypto/blake2b
$ go test -bench .
BenchmarkWrite128-12 3000000 351 ns/op 364.08 MB/s
BenchmarkWrite1K-12 500000 2436 ns/op 420.20 MB/s
BenchmarkSum128-12 5000000 344 ns/op 371.54 MB/s
BenchmarkSum1K-12 500000 2467 ns/op 414.91 MB/s
PASS
ok golang.org/x/crypto/blake2b 6.115s
Witth SSE4
$ go test -bench .
SSE4
BenchmarkWrite128-12 5000000 236 ns/op 540.17 MB/s
BenchmarkWrite1K-12 1000000 1528 ns/op 669.91 MB/s
BenchmarkSum128-12 5000000 354 ns/op 360.71 MB/s
BenchmarkSum1K-12 1000000 1686 ns/op 607.12 MB/s
PASS
ok golang.org/x/crypto/blake2b 6.842s
With AVX
$ go test -bench .
AVX
BenchmarkWrite128-12 3000000 333 ns/op 383.37 MB/s
BenchmarkWrite1K-12 500000 2436 ns/op 420.35 MB/s
BenchmarkSum128-12 5000000 344 ns/op 371.52 MB/s
BenchmarkSum1K-12 500000 2592 ns/op 394.96 MB/s
PASS
ok golang.org/x/crypto/blake2b 6.091s
With AVX2
$ go test -bench .
AVX2
BenchmarkWrite128-12 500000 2084 ns/op 61.40 MB/s
BenchmarkWrite1K-12 100000 16227 ns/op 63.10 MB/s
BenchmarkSum128-12 1000000 2107 ns/op 60.74 MB/s
BenchmarkSum1K-12 100000 16312 ns/op 62.77 MB/s
PASS
ok golang.org/x/crypto/blake2b 7.647s
Missed I something? On time you get < 70 MB/s for AVX2 and the next time you get > 350 MB/s ?!
Missed I something? On time you get < 70 MB/s for AVX2 and the next time you get > 350 MB/s ?!
No i edited, it was a copy paste problem :-)
Oh okay, no problem :-)
Oh okay, no problem :-)
Pinged you on gitter send me your SSH keys can give you access to the XEON machines we have. Thanks @goarchit for the offer.
@harshavardhana Okay, I get the same low performance for AVX2 but I see:
AVX
BenchmarkWrite128-12 10000000 216 ns/op 589.97 MB/s
BenchmarkWrite1K-12 1000000 1532 ns/op 668.30 MB/s
BenchmarkSum128-12 10000000 245 ns/op 520.60 MB/s
BenchmarkSum1K-12 1000000 1562 ns/op 655.39 MB/s
PASS
ok golang.org/x/crypto/blake2b 8.239s
If that's the "normal" AVX performance, the AVX2 code is the problem...
Replacing the PINSRQ through VPINSRQ and adding VZEROUPPER doesn't change anything - I'll look at this tomorrow again...
A performance drop like this can very well be related to unintended saving of the upper half of the AVX registers -- so my suggestion would indeed be to see if VZEROUPPER etc. can help (or even better would not be needed unnecessarily altogether).
Okay, I know the problem: SSE-AVX savings in the LOAD_MSG_AVX2
macro.
The critical instructions are PINSRQ
and MOVQ
- I replaced them with VPINSRQ
and VMOVQ
and got:
AVX2
BenchmarkWrite128-12 5000000 263 ns/op 485.06 MB/s
BenchmarkWrite1K-12 1000000 1617 ns/op 633.24 MB/s
BenchmarkSum128-12 5000000 292 ns/op 438.23 MB/s
BenchmarkSum1K-12 1000000 1707 ns/op 599.73 MB/s
I'll will replace them and optimize the code to get the "old" performance again... This affects the AVX code as well...
Thanks @aead - this looks good.
Nice, you have to be really careful with the correct "flavour" of the instruction...
So, I modified the AVX and AVX2 code to use only AVX instructions and got this on the XEON machine:
BenchmarkWrite128-12 5000000 228 ns/op 561.39 MB/s
BenchmarkWrite1K-12 1000000 1131 ns/op 904.78 MB/s
BenchmarkSum128-12 10000000 222 ns/op 575.70 MB/s
BenchmarkSum1K-12 1000000 1198 ns/op 854.54 MB/s
I submitted the code for review and will send a PR to minio as soon as the code passes the review process.
@fwessels
I'm also a little bit surprised, that the MOVOU
(SSE) instruction in compressAvx_amd64.s doesn't lead to some similar performance problems as the MOVQ
instruction. But maybe the XMM,m128 moves are not affected by register savings?!
@aead good point, at least it doesn't do on my MacBook Pro (which has AVX but lacks AVX2). On the server where we have AVX2 I never actually benchmarked the AVX version (it would have been slower anyway), so perhaps it would have shown a similar behaviour.
@fwessels I tested the AVX implementation (blake2b-simd) on the XEON and got the expected performance - so MOVOU (for example) doesn't seem to lead to SSE-AVX transitions... Nice to know :-)
@aead Yes, that is good to know (otherwise I'd have probably noticed it, if you get it wrong there are typically significant penalties...)
Per chance have you run speed comparisons vs. the other common implementation of blake2b?