There's a performance improved fork of this project

rogers0 commented 6 years ago

I'm debian pkg maintainer of your reedsolomon project. It's got my attention that there's a fork of your project that claims big performance boost:

https://github.com/templexxx/reedsolomon

I asked the fork author to send the improvement patch to you upstream, but he/she seems kinda not interested in it.

https://github.com/templexxx/reedsolomon/issues/9

I looked at two project and find there's much differences that beyond my ability to make such patches. So maybe you're interested in those improvements and can take a look at the project? Thanks!

klauspost commented 6 years ago

Honestly, I don't think the difference is that big. You can tweak the benchmarks in this package to operate on data in caches.

Running on my local Desktop (Core i7-2600 @3.4Ghz):

klauspost:
BenchmarkEncode10x4x16M-4                     30          41895236 ns/op        4004.56 MB/s
BenchmarkEncode10x4x16M-2                     30          45082120 ns/op        3721.48 MB/s
BenchmarkEncode10x4x16M                       20          72077065 ns/op        2327.68 MB/s

templexxx:
BenchmarkEnc/#01/10+4_16777216-8              20          63102500 ns/op        2658.72 MB/s

So single core performance is a big better, but nothing mindblowing.

I looked at the code, and it is pretty much the same. It has a bit more unrolling, but the code is memory limited anyway. I have privately done an unrolled version, but it didn't make any practical difference, so I just dropped it to keep complexity down.

Adding Cauchy matrix would be neat.

klauspost commented 6 years ago

Added #70 with Cauchy matrix.

templexxx commented 6 years ago

@klauspost

Thank your for your contribution, you really help me a lot 👍

Yes, the code is memory limited, and the loop-unrolling can't help much. ( but as a green hands, it's a good practice for me, :D)

I made some tricks for cache-friendly, so it will make performance improving when the shards size is big or the size can't divisible by 16 or the size is very small ( < 4KB).

for example: I split the shard ( about 16KB) to fit the L1 data cache, it's good for big shards

So I think BenchmarkEnc/#01/10+4_16777216-8 should be much faster than 2658.72 MB/s, maybe there was something wrong with that? It was much slower than I expect

I hope I'm not bothering you

Thanks

klauspost commented 6 years ago

It is probably because I am on a system without AVX2. Here is a benchmark with AVX2:

I did some tests, and it seems like the "maximum goroutines" could benefit from some adjustments:

templexxx:
BenchmarkEnc/#01/10+4_16777216                50          30111190 ns/op        5571.75 MB/s

WithMaxGoroutines(512)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8            100          13746566 ns/op        12204.66 MB/s
BenchmarkEncode10x4x16M-4            100          14333118 ns/op        11705.21 MB/s
BenchmarkEncode10x4x16M-2            100          21885522 ns/op        7665.90 MB/s
BenchmarkEncode10x4x16M               30          37515470 ns/op        4472.08 MB/s

WithMaxGoroutines(50) (default)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8             30          65574413 ns/op        2558.50 MB/s
BenchmarkEncode10x4x16M-4             30          47900543 ns/op        3502.51 MB/s
BenchmarkEncode10x4x16M-2             50          33218350 ns/op        5050.59 MB/s
BenchmarkEncode10x4x16M               20          52264020 ns/op        3210.09 MB/s

So there is still some performance to be had with some tweaks.

templexxx commented 6 years ago

@klauspost

Thank you for your testing

templexxx commented 6 years ago

@rogers0

@klauspost and I do same job in different way, we have same core codes. But I make it work on a single goroutine.

So I think it's hard to merge my codes to his.

And as @klauspost said, the code is memory limited. So I think there is no need to do that either.

I think klauspost can do much better job than me in many ways, I still need to learn from him a lot of things.

fwessels commented 6 years ago

We also tested against this project but did not find significant performance differences between the two projects.

templexxx commented 6 years ago

@fwessels

yes, klauspost/reedsolomon run on multi-cores, mine is not.

that's the main difference

klauspost / reedsolomon

There's a performance improved fork of this project #69