Closed rogers0 closed 6 years ago
Honestly, I don't think the difference is that big. You can tweak the benchmarks in this package to operate on data in caches.
Running on my local Desktop (Core i7-2600 @3.4Ghz):
klauspost:
BenchmarkEncode10x4x16M-4 30 41895236 ns/op 4004.56 MB/s
BenchmarkEncode10x4x16M-2 30 45082120 ns/op 3721.48 MB/s
BenchmarkEncode10x4x16M 20 72077065 ns/op 2327.68 MB/s
templexxx:
BenchmarkEnc/#01/10+4_16777216-8 20 63102500 ns/op 2658.72 MB/s
So single core performance is a big better, but nothing mindblowing.
I looked at the code, and it is pretty much the same. It has a bit more unrolling, but the code is memory limited anyway. I have privately done an unrolled version, but it didn't make any practical difference, so I just dropped it to keep complexity down.
Adding Cauchy matrix would be neat.
Added #70 with Cauchy matrix.
@klauspost
Thank your for your contribution, you really help me a lot 👍
Yes, the code is memory limited, and the loop-unrolling can't help much. ( but as a green hands, it's a good practice for me, :D)
I made some tricks for cache-friendly, so it will make performance improving when the shards size is big or the size can't divisible by 16 or the size is very small ( < 4KB).
for example: I split the shard ( about 16KB) to fit the L1 data cache, it's good for big shards
So I think BenchmarkEnc/#01/10+4_16777216-8 should be much faster than 2658.72 MB/s, maybe there was something wrong with that? It was much slower than I expect
I hope I'm not bothering you
Thanks
It is probably because I am on a system without AVX2. Here is a benchmark with AVX2:
I did some tests, and it seems like the "maximum goroutines" could benefit from some adjustments:
templexxx:
BenchmarkEnc/#01/10+4_16777216 50 30111190 ns/op 5571.75 MB/s
WithMaxGoroutines(512)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8 100 13746566 ns/op 12204.66 MB/s
BenchmarkEncode10x4x16M-4 100 14333118 ns/op 11705.21 MB/s
BenchmarkEncode10x4x16M-2 100 21885522 ns/op 7665.90 MB/s
BenchmarkEncode10x4x16M 30 37515470 ns/op 4472.08 MB/s
WithMaxGoroutines(50) (default)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8 30 65574413 ns/op 2558.50 MB/s
BenchmarkEncode10x4x16M-4 30 47900543 ns/op 3502.51 MB/s
BenchmarkEncode10x4x16M-2 50 33218350 ns/op 5050.59 MB/s
BenchmarkEncode10x4x16M 20 52264020 ns/op 3210.09 MB/s
So there is still some performance to be had with some tweaks.
@klauspost
Thank you for your testing
@rogers0
@klauspost and I do same job in different way, we have same core codes. But I make it work on a single goroutine.
So I think it's hard to merge my codes to his.
And as @klauspost said, the code is memory limited. So I think there is no need to do that either.
I think klauspost can do much better job than me in many ways, I still need to learn from him a lot of things.
We also tested against this project but did not find significant performance differences between the two projects.
@fwessels
yes, klauspost/reedsolomon run on multi-cores, mine is not.
that's the main difference
I'm debian pkg maintainer of your reedsolomon project. It's got my attention that there's a fork of your project that claims big performance boost:
I asked the fork author to send the improvement patch to you upstream, but he/she seems kinda not interested in it.
I looked at two project and find there's much differences that beyond my ability to make such patches. So maybe you're interested in those improvements and can take a look at the project? Thanks!