minio / blake2b-simd

Fast hashing using pure Go implementation of BLAKE2b with SIMD instructions
Apache License 2.0
253 stars 31 forks source link

Speed vs. "Optimized" codahale version? #24

Closed goarchit closed 7 years ago

goarchit commented 7 years ago

Per chance have you run speed comparisons vs. the other common implementation of blake2b?

harshavardhana commented 7 years ago

Per chance have you run speed comparisons vs. the other common implementation of blake2b?

No there are no intentions of doing that @goarchit - happy to see some results from community if any.

goarchit commented 7 years ago

One data point: Hashing a 5.8GB tarball on a Jetway (SSE2 only, quad core celeron - SSD based - hopefully the low-end of PC's these days, at least CPU wise):

Minio: / First run - nothing in OS cache / real 1m11.658s user 1m8.976s sys 0m2.680s

Codahale: real 1m8.054s user 1m6.611s sys 0m1.954s

Minio: / Second run - sanity check, just in case OS cacheing played a roll / real 1m10.302s user 1m8.865s sys 0m1.976s

Code used:

import (
        "fmt"
        "github.com/minio/blake2b-simd"
//      "github.com/codahale/blake2"
        "io"
        "os"
)

func ComputeBlake2B(filePath string) ([]byte, error) {
        var result []byte
        file, err := os.Open(filePath)
        if err != nil {
                return result, err
        }
        defer file.Close()

//      hash := blake2.NewBlake2B()
        hash, err := blake2b.New(nil)
        if err != nil {
                fmt.Println("Blake2B Error: ", err)
        }
        if _, err := io.Copy(hash, file); err != nil {
                return result, err
        }

        return hash.Sum(result), nil
}

func main() {
        if b, err := ComputeBlake2B("/home/everything.tar"); err != nil {
                fmt.Printf("Err: %v", err)
        } else {
                fmt.Printf("/home/everything.tar Blake2b checksum is: %x", b)
        }
}
goarchit commented 7 years ago

Sorry about the formatting, not sure why pasting the code between ticks (e.g. "insert code") didn't work.

fwessels commented 7 years ago

Thanks for the benchmark, looks pretty comparable between the two.

Using more powerful CPU with AVX or AVX2 would double hashing speed.

goarchit commented 7 years ago

Ran a similar test (6.2GB tarball) on a sample of the other extreme: A dual socket Xeon with a combined 28 cores and 56 thread capability, L3 cache, AVX and AVX2. Again off a (likely faster) SSD.

Mimio: real 0m16.623s user 0m13.060s sys 0m2.640s

Codahale: real 0m29.193s user 0m24.800s sys 0m2.980s

Mimio: (sanity check): real 0m16.443s user 0m13.170s sys 0m2.560s

That server was moderately busy, so I idled it and reran:

Mimio: real 0m10.744s user 0m9.480s sys 0m1.270s

Codahale: real 0m17.705s user 0m16.670s sys 0m1.040s

Bottom line: Mimio is approximately the same speed on low end CPUs, and takes ~40% less resources on a high end CPU system. Nice Job! Pretty clear which codebase our project will be using, shy of any surprises benchmarking some other algos.

update: Just for grins I timed the golang/x/crypto/Blake2B package: real 2m19.175s user 2m18.290s sys 0m0.990s

harshavardhana commented 7 years ago

update: Just for grins I timed the golang/x/crypto/Blake2B package: real 2m19.175s user 2m18.290s sys 0m0.990s

This seems like odd in-fact crypto/blake2b is faster than minio-blake2b implementation - are you using the latest commit from - https://github.com/golang/crypto/tree/master/blake2b ?

// cc @aead

harshavardhana commented 7 years ago

Did some tests locally golang/x/crypto/blake2b is giving quite close to minio/blake2b-simd or better.

goarchit commented 7 years ago

No... I pulled from the path referenced...

Built using github.coim/golang/crypto/blake2b (after a go get), idled the box to recreate the original environment, and got a time of: real 2m3.855s user 2m2.150s sys 0m1.780s

Very similar to what I previously reported.

aead commented 7 years ago

@goarchit Can you provide the benchmark results of go test -bench=Benchmark on your machine(s)?

On my i7-6500U (linux/amd64) the codahale implementation is about 400 MB/s slower than x/crypto/blake2b - which is not surprising, because codahale relies on the "official" C implementation (using "just" SSE4.1) vs. the AVX2 code of /x/crypto

Did some tests locally golang/x/crypto/blake2b is giving quite close to minio/blake2b-simd or better.

Maybe there are some other side effects, but on my machines the I see the same as you.

goarchit commented 7 years ago

Per your request:

BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s PASS ok github.com/golang/crypto/blake2b 7.705s

And just to be complete, the last cpu's output from a cat /proc/cpuinfo

processor : 55 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz stepping : 1 microcode : 0xb00001d cpu MHz : 2401.000 cache size : 35840 KB physical id : 1 siblings : 28 core id : 14 cpu cores : 14 apicid : 61 initial apicid : 61 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc bugs : bogomips : 4800.57 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

harshavardhana commented 7 years ago

BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s

what is the Go version you are using @goarchit ?

aead commented 7 years ago

BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s

The performance is such low - that's something unexpected. On this kind of machine I would expect values about >= 800 MB/s... The CPU supports AVX2 so probably the AVX2 code is used (as long go version >= 1.7)

Unfortunately I don't have access to this kind of CPU - so I cannot reproduce or test this on my own. Maybe (this is a guess) the LOAD_MSG_AVX2 macro causes some trouble (maybe the the Xeon has problems with the AVX2-SSE mixing code), but as mentioned, I cannot test this...

@goarchit Can you replace line 10 in blake2bAVX2_amd64.s with useAVX2 = supportsAVX2() && false and run the benchmark again?

goarchit commented 7 years ago

@harshavardhana go version go1.7.4 linux/amd64

@aead Presume you meant black2bAVX2_amd64.go... similar results: BenchmarkWrite128-56 1000000 2028 ns/op 63.10 MB/s BenchmarkWrite1K-56 100000 15764 ns/op 64.96 MB/s BenchmarkSum128-56 1000000 2047 ns/op 62.53 MB/s BenchmarkSum1K-56 100000 15836 ns/op 64.66 MB/s PASS ok github.com/golang/crypto/blake2b 7.616s

@aead E-mail me and I can work out access for you to the box. Its normally busy doing some CPU mining and other stuff, but I can manually idle it for you while you test. Would just need to coordinate and keep windows reasonable (an hour or two at a time, not days or weeks)

harshavardhana commented 7 years ago

Was able to reproduce this problem on our Xeon hardware

$ go test -bench=Benchmark
BenchmarkWrite128-12              500000              2243 ns/op          57.05 MB/s
BenchmarkWrite1K-12               100000             16225 ns/op          63.11 MB/s
BenchmarkSum128-12               1000000              2106 ns/op          60.77 MB/s
BenchmarkSum1K-12                 100000             16287 ns/op          62.87 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.885s

v/s blake2b-simd

BenchmarkComparisonMD5-12                   1000           1741000 ns/op         602.28 MB/s
BenchmarkComparisonSHA1-12                  1000           1393112 ns/op         752.69 MB/s
BenchmarkComparisonSHA256-12                 500           3006689 ns/op         348.75 MB/s
BenchmarkComparisonSHA512-12                 500           3440423 ns/op         304.78 MB/s
BenchmarkComparisonBlake2B-12               1000           1229740 ns/op         852.68 MB/s
BenchmarkSize64-12                       3000000               654 ns/op          97.80 MB/s
BenchmarkSize128-12                      3000000               511 ns/op         250.45 MB/s
BenchmarkSize1K-12                       1000000              2014 ns/op         508.25 MB/s
BenchmarkSize8K-12                        100000             10483 ns/op         781.43 MB/s
BenchmarkSize32K-12                        50000             35959 ns/op         911.25 MB/s
BenchmarkSize128K-12                       10000            143302 ns/op         914.65 MB/s

Looks like a bug @aead

aead commented 7 years ago

@harshavardhana Switching to AVX code doesn't change anything - so my guess about the AVX2-SSE switching seems to be wrong. Maybe you can run the benchmark for the different SIMD-instructions (SSE4.1, AVX, AVX2)? I'll take a deeper look into blake2b-simd assembly and search for significant differences in mem/cache accesses, etc.

Maybe creating an issue at golang is a good idea?!

@goarchit
Okay, now I'm really blankly... Thanks for the offer, I'll will do some investigation on this and come back to you! :smiley:

harshavardhana commented 7 years ago

Plain blake2b without any optimizations using golang.org/x/crypto/blake2b

$ go test -bench .
BenchmarkWrite128-12             3000000               351 ns/op         364.08 MB/s
BenchmarkWrite1K-12               500000              2436 ns/op         420.20 MB/s
BenchmarkSum128-12               5000000               344 ns/op         371.54 MB/s
BenchmarkSum1K-12                 500000              2467 ns/op         414.91 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.115s

Witth SSE4

$ go test -bench .
SSE4
BenchmarkWrite128-12             5000000               236 ns/op         540.17 MB/s
BenchmarkWrite1K-12              1000000              1528 ns/op         669.91 MB/s
BenchmarkSum128-12               5000000               354 ns/op         360.71 MB/s
BenchmarkSum1K-12                1000000              1686 ns/op         607.12 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.842s

With AVX

$ go test -bench .
AVX
BenchmarkWrite128-12             3000000               333 ns/op         383.37 MB/s
BenchmarkWrite1K-12               500000              2436 ns/op         420.35 MB/s
BenchmarkSum128-12               5000000               344 ns/op         371.52 MB/s
BenchmarkSum1K-12                 500000              2592 ns/op         394.96 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.091s

With AVX2

$ go test -bench .
AVX2
BenchmarkWrite128-12              500000              2084 ns/op          61.40 MB/s
BenchmarkWrite1K-12               100000             16227 ns/op          63.10 MB/s
BenchmarkSum128-12               1000000              2107 ns/op          60.74 MB/s
BenchmarkSum1K-12                 100000             16312 ns/op          62.77 MB/s
PASS
ok      golang.org/x/crypto/blake2b     7.647s
aead commented 7 years ago

Missed I something? On time you get < 70 MB/s for AVX2 and the next time you get > 350 MB/s ?!

harshavardhana commented 7 years ago

Missed I something? On time you get < 70 MB/s for AVX2 and the next time you get > 350 MB/s ?!

No i edited, it was a copy paste problem :-)

aead commented 7 years ago

Oh okay, no problem :-)

harshavardhana commented 7 years ago

Oh okay, no problem :-)

Pinged you on gitter send me your SSH keys can give you access to the XEON machines we have. Thanks @goarchit for the offer.

aead commented 7 years ago

@harshavardhana Okay, I get the same low performance for AVX2 but I see:

AVX
BenchmarkWrite128-12        10000000           216 ns/op     589.97 MB/s
BenchmarkWrite1K-12          1000000          1532 ns/op     668.30 MB/s
BenchmarkSum128-12          10000000           245 ns/op     520.60 MB/s
BenchmarkSum1K-12            1000000          1562 ns/op     655.39 MB/s
PASS
ok      golang.org/x/crypto/blake2b 8.239s

If that's the "normal" AVX performance, the AVX2 code is the problem...

aead commented 7 years ago

Replacing the PINSRQ through VPINSRQ and adding VZEROUPPER doesn't change anything - I'll look at this tomorrow again...

fwessels commented 7 years ago

A performance drop like this can very well be related to unintended saving of the upper half of the AVX registers -- so my suggestion would indeed be to see if VZEROUPPER etc. can help (or even better would not be needed unnecessarily altogether).

aead commented 7 years ago

Okay, I know the problem: SSE-AVX savings in the LOAD_MSG_AVX2 macro. The critical instructions are PINSRQ and MOVQ - I replaced them with VPINSRQ and VMOVQ and got:

AVX2
BenchmarkWrite128-12         5000000           263 ns/op     485.06 MB/s
BenchmarkWrite1K-12          1000000          1617 ns/op     633.24 MB/s
BenchmarkSum128-12           5000000           292 ns/op     438.23 MB/s
BenchmarkSum1K-12            1000000          1707 ns/op     599.73 MB/s

I'll will replace them and optimize the code to get the "old" performance again... This affects the AVX code as well...

harshavardhana commented 7 years ago

Thanks @aead - this looks good.

fwessels commented 7 years ago

Nice, you have to be really careful with the correct "flavour" of the instruction...

aead commented 7 years ago

So, I modified the AVX and AVX2 code to use only AVX instructions and got this on the XEON machine:

BenchmarkWrite128-12         5000000           228 ns/op     561.39 MB/s
BenchmarkWrite1K-12          1000000          1131 ns/op     904.78 MB/s
BenchmarkSum128-12          10000000           222 ns/op     575.70 MB/s
BenchmarkSum1K-12            1000000          1198 ns/op     854.54 MB/s

I submitted the code for review and will send a PR to minio as soon as the code passes the review process.

@fwessels I'm also a little bit surprised, that the MOVOU (SSE) instruction in compressAvx_amd64.s doesn't lead to some similar performance problems as the MOVQ instruction. But maybe the XMM,m128 moves are not affected by register savings?!

fwessels commented 7 years ago

@aead good point, at least it doesn't do on my MacBook Pro (which has AVX but lacks AVX2). On the server where we have AVX2 I never actually benchmarked the AVX version (it would have been slower anyway), so perhaps it would have shown a similar behaviour.

aead commented 7 years ago

@fwessels I tested the AVX implementation (blake2b-simd) on the XEON and got the expected performance - so MOVOU (for example) doesn't seem to lead to SSE-AVX transitions... Nice to know :-)

fwessels commented 7 years ago

@aead Yes, that is good to know (otherwise I'd have probably noticed it, if you get it wrong there are typically significant penalties...)