Optimize ReconstructSome for leopard 8+16

elias-orijtech commented 8 months ago

CC @odeke-em

odeke-em commented 8 months ago

/cc @klauspost @liamsi @musalbas; continued work on reconstructSome thanks to Elias!

liamsi commented 8 months ago

I think this change makes sense as is. But I originally thought that there is a way to only recompute the required shares and I also thought that is what the other implementation was doing. @klauspost can you confirm that the non-leopard implementation operates only on the required shares?

Like it looks like this only operates on required shards: https://github.com/klauspost/reedsolomon/blob/5b85c725bcb6b785d9a273a267eb698b1f02c262/reedsolomon.go#L1560-L1606

Also, agree with @klauspost that benchmarks/numbers would help understand the impact of this (if any).

klauspost commented 8 months ago

Leopard is pyramid encoded, so several layers are needed to reconstruct a bottom shard.

liamsi commented 8 months ago

Leopard is pyramid encoded, so several layers are needed to reconstruct a bottom shard.

It's been too long since I looked into leopard myself and I don't fully understand this tbh. Just from a high-level perspective, it should be possible to evaluate a polynomial given enough points (i.e. n+1) in O(n), with n being the degree of the polynomial (without having to fully recompute that polynomial). Are you saying that leopard does work differently/does not allow this and has to do its O(n log n) computations either way?

liamsi commented 8 months ago

If you look at the paper, what I mean is that you should be able to save computations by simply not adding missing shard positions to the set E if they are not required:

https://arxiv.org/pdf/1404.3458.pdf

In the implementation this should correspond to these: https://github.com/klauspost/reedsolomon/blob/162f2bad346238488d8652bdf46292cb947b19b7/leopard.go#L423-L425 So a simple check if they are required before setting them could do the job. But it might not be as easy as that as other parts of the code might operate under the assumption that if a position is not in E, then there was no error or erasure. So there might be other changes necessary still.

klauspost commented 8 months ago

Sorry. I don't read math. I can't even tell if the paper relates to Leopard.

Reading the code it seems like input is divided into "mips" (probably derived from mip-maps from 3D) that are stacked for the final output. (*errorBitfield).prepare() converts missing shards into the mips that are needed.

This makes sense to give the O(N*log(N)) characteristics of the encoding and means that to decode one shard all mip levels above have to be resolved. The isNeeded will resolve if a given mip level is needed for the reconstruction.

There may very well be additional optimization possible here, but I have no idea where to start looking for it.

liamsi commented 8 months ago

I think it is important that we have a way to compare performance, e.g. if we are trying to recover a single shard as required. Otherwise we don't know if this PR (or changes to it) improves anything.

klauspost commented 8 months ago

I think it is important that we have a way to compare performance, e.g. if we are trying to recover a single shard as required. Otherwise we don't know if this PR (or changes to it) improves anything.

Yes. Currently BenchmarkDecode1K tests (among others) the time for reconstructing a single shard (leopard-gf16-single). You can see the numbers for that in this chart - "Recover One" column.

This will probably not have changed those numbers, since the only difference from using ReconstructSome is that it will ignore some other missing shards. So a similar benchmark that tests ReconstructSome with some mutations is probably needed. I wouldn't expect it to be faster than "Recover One".

(But do note that 1K is so small the setup time is mostly dominating this particular benchmark)

liamsi commented 8 months ago

Let me try if what I wrote above (inspired by the paper that is the basis for leopard) and see if only adding required shards to the error bits, breaks anything and also if it actually has any noticeable performance gains.

elias-orijtech commented 8 months ago

Based on #274, I've included the optimization for the error bits. See https://github.com/klauspost/reedsolomon/pull/272/commits/0ae8b020f6185da93505c31ba276614082d156f9 for details. In short:

errLocs must reflect the erasures, but also doesn't have an effect on runtime.
From my reading of the original C source code, errBits does seem to allow omission of unwanted shards.
I've added a more thorough ReconstructSome test.

The new tests fails in regular non-leopard configuration with

% go test -run ReconstructSome -v -short
Using pure Go
=== RUN   TestReconstructSome
panic: runtime error: slice bounds out of range [:65664] with capacity 0

goroutine 70 [running]:
github.com/klauspost/reedsolomon.(*reedSolomon).codeSomeShardsP.func1(0x0?, 0x10080)
    /Users/a/proj/orijtech/reedsolomon/reedsolomon.go:979 +0x2a4
created by github.com/klauspost/reedsolomon.(*reedSolomon).codeSomeShardsP in goroutine 34
    /Users/a/proj/orijtech/reedsolomon/reedsolomon.go:1011 +0x248
exit status 2
FAIL    github.com/klauspost/reedsolomon    0.622s

but I've so far failed to find an error in the test (and the leopards succeed). @klauspost can you spot my error?

elias-orijtech commented 8 months ago

Another issue with optimizing errBits is that the leopard8 cache is no longer correct: https://github.com/klauspost/reedsolomon/blob/4e91954739f00400b1c60e81096cb56fa2aea207/leopard8.go#L531 because there is no longer a one-to-one relationship between errBits and errLocs.

elias-orijtech commented 8 months ago

I've fixed the caching issue by no longer including errBits, at the cost of calling errBits.prepare even for cache hits.

klauspost commented 8 months ago

Great stuff. I will take a look as soon as I get some extra time.

odeke-em commented 6 months ago

Kind ping @elias-orijtech @klauspost :-)

klauspost / reedsolomon

Optimize ReconstructSome for leopard 8+16 #272