Override limit on 256 shards

rphmeier commented 5 years ago

I have a project where I need a reed-solomon encoding which may have thousands of pieces.

Can you explain the limitation on 256 shards and what it would take to lift?

rphmeier commented 5 years ago

https://github.com/klauspost/reedsolomon/issues/78

I found this, which indicates that it has to do with the galois field implementation.

rphmeier commented 5 years ago

(making the encoder generic over field is some work we might be interested in)

darrenldl commented 5 years ago

Yep you're correct, it is related to the galois field implementation. Specifically both the library you linked and this library are using galois field 2^8 implementation, which means the coding matrix is 256x256 at most in size.

The more specific explanation is available in the paper/document here (page 10 onward).

Essentially, the encoding and decoding matrices are derived from Vandermonde matrix, where it needs to uphold row-wise linear independence (see page 11 of paper).

To see why going beyond the maximum of 2^m rows would violate the condition, we consider the following. Pick a field 2^m, thus we have 2^m distinct values, allowing us the generate the 2^m rows following the Vandermonde matrix pattern. If we go beyond 2^m -1 (the maximum value in the field) to generate a new row, then by field arithmetic we go back into one of the values in [0, 2^m - 1], thus the new row is a duplicate of one of the above rows, violating linear independence.

darrenldl commented 5 years ago

To lift the restraint, a new implementation using at least galois field 2^16 needs to be used. The exponent is usually a multiple of 8 to align with the fact that bytes are usually 8 bits in size. So gf16 (or galois field 2^16) would take 16 bits at a time.

I personally don't have the time to investigate that, and also lack the expertise to write the required SIMD accelerated code based on this paper (in a reasonable time frame anyway, as it's not my typical field of research or hobby).

I might be able to do a simple Rust implementation of gf16 without SIMD acceleration down the road, but I won't have time for next 2-3 months. Also if the usage is not too common, then I might not be willing to sink too much time in this. Personally gf8 suffices my needs thus far, so I'm not overly incentivised.

darrenldl commented 5 years ago

Mind if you share some of the requirements here/via email? Depending on your requirements, the current implementation may suffice with specific shard arrangements.

rphmeier commented 5 years ago

Our situation: we have a piece of data which will likely be a few megabytes but potentially much larger.

We want to erasure code it in n pieces, with f+1 data shards (n = 3f+k, 1 <= k <= 3), such that any f+1 shards can recover the original. In practice, n will be less than 2^16 but to be safe we would prefer to use either GF(2^32) or GF((2^16)^2) or perhaps some large prime field (being mildly unconvinced that binary fields are fast in software)

The coding is not likely to be the bottleneck for our use-case as long as it is reasonably fast -- and we have the dev resources to point at adding new field implementations to the crate, although probably not for SIMD-optimizing them at this point.

burdges commented 5 years ago

It's extremely helpful just to point us to the papers you believe relevant, so thank you! :)

burdges commented 5 years ago

Just fyi, we're actually implementing a variant of this proposal, so the point to have numerous pieces with which to prove data was shared, not to be optimal, and GF(2^16) gives us 16k/size pieces.

We're not necessarily doing the two-dimensional encoding in that paper, so maybe ideas like fountain codes actually serve this use case best. In fact, repairable fountain codes are designed for more similar use cases than most fountain codes, but reed-solomon is easy to understand. :)

darrenldl commented 5 years ago

I looked at the proposal, and even with the original scheme at page 13, the size limits of (byte level) Reed-Solomon codes are still the limiting factors of how much data can be accommodated, so I can see why one might as well just go with (block level) Reed-Solomon erasure codes.

The authors of the paper on SIMD Reed-Solomon made an implementation as it turned out (I didn't read it very carefully when I started this project), the mirror of which can be seen here, their active dev repo can be seen here.

Their core(?) library jerasure is potentially worth looking at as well.

At 16 bit, a single naive lookup table/matrix is 4 GiB in size, at 32 bit, the table/matrix size is well beyond reasonable. I am uncertain how they do it without using a lot of memory since I'm not investigating deeply, but this seems like a good starting point should you pursue GF2^16, GF2^32, or GF2^128 support (I might have gotten the numbers wrong, since I'm only judging from the file names).

darrenldl commented 5 years ago

The coding is not likely to be the bottleneck for our use-case as long as it is reasonably fast -- and we have the dev resources to point at adding new field implementations to the crate, although probably not for SIMD-optimizing them at this point.

The new field implementations would be absolutely much appreciated. I am uncertain how much help I can be of for next 2-3 months, as I'll be dedicating my time completely to a research project to try to meet a conference submission deadline. I might be able to chime in or take a look every now and then, but absolutely can't guarantee any time at all.

I am happy to review, refactor code, and add tests by the end of my working period however.

darrenldl commented 5 years ago

We're not necessarily doing the two-dimensional encoding in that paper, so maybe ideas like fountain codes actually serve this use case best. In fact, repairable fountain codes are designed for more similar use cases than most fountain codes, but reed-solomon is easy to understand. :)

Sorry missed this text. Fountain codes seem better yep, if I'm understanding the limitless sequence property correctly anyway.

Personally, I feel that if Fountain codes have sufficient performance, then it would be the most flexible and scalable. Otherwise Reed-Solomon erasure code at least have SIMD acceleration, and reasonably simple operations overall to warrant okay performance even if SIMD is not available.

Just as reference for time requirements in case you do pursue the Reed-Solomon route, when porting klauspost/reedsolomon to Rust, I think it took me 1-2 weeks, then another 1-2 weeks for the test suite. Since the current test suite can be used as reference, the most time consuming requirement would be implementing the new field implementations.

You might be able to use the SIMD C files from the GF-Complete project almost directly via C FFI, but again I'm not sure how tables and other data structures are arranged.

burdges commented 5 years ago

Thank again! We expect GF(2^16) would be implemented as a field extension of GF(2^8), so you'd have a second table of size 256 for the field extension. And @drskalman said he implemented larger binary fields this way before. I've no idea if that's actually the fastest approach, but sounds reasonable.

Just if you're curious but kinda off topic here:

In our situation, there is a proof that each piece is included in in some commitment, so probably a merkle tree proof, but accumulators work too. In that paper noted above, this proof lives outside the chunks, but I'd wondered if it could be put inside the original data chunks, and have the erasure coding commute with the proof, so that each piece of the encoded data came with a proof of inclusion, including the pieces constructed with the erasure code. I then noticed the SWIFFT hash function looked relevant due to it's homomorphic property, so I asked the authors of the SWIFFT paper if this made any sense, and..

Alon Rosen confirmed that yes SWIFFT was applicable to what they call "authenticated data structures" and pointed me to this paper as being around the state of the art in solving this self authenticating chunks problem: http://users.umiacs.umd.edu/~zhangyp/papers/p129-qian.pdf

There is no interaction with an erasure code there, but it's an interesting direction. :)

darrenldl commented 5 years ago

Thank again! We expect GF(2^16) would be implemented as a field extension of GF(2^8), so you'd have a second table of size 256 for the field extension. And @drskalman said he implemented larger binary fields this way before. I've no idea if that's actually the fastest approach, but sounds reasonable.

Oh nice! That's interesting, now I'm curious how the maths work haha.

In our situation, there is a proof that each piece is included in in some commitment, so probably a merkle tree proof, but accumulators work too. In that paper noted above, this proof lives outside the chunks, but I'd wondered if it could be put inside the original data chunks, and have the erasure coding commute with the proof, so that each piece of the encoded data came with a proof of inclusion, including the pieces constructed with the erasure code. I then noticed the SWIFFT hash function looked relevant due to it's homomorphic property, so I asked the authors of the SWIFFT paper if this made any sense, and..

Alon Rosen confirmed that yes SWIFFT was applicable to what they call "authenticated data structures" and pointed me to this paper as being around the state of the art in solving this self authenticating chunks problem: http://users.umiacs.umd.edu/~zhangyp/papers/p129-qian.pdf

I'll revisit your paragraphs later when I'm more awake. Thanks for the pointers to the papers as well! They certainly look interesting as they feel pretty related to what I'm working on (protocol verification) and what I study as hobbies (logic and security).

I am interested in application of cryptographic primitives etc in general, but I'm definitely leaning on the logic side of things rather than maths side of things.

rphmeier commented 5 years ago

Opened #34 with some cleanup in preparation for making ReedSolomon generic over Field. We are doing a GF(2^16) or GF(2^32) implementation which will implement this trait.

darrenldl commented 5 years ago

Thanks for the PR! I went through briefly and will have a closer look later.

I noticed the SIMD acceleration is removed entirely, which is used by some people of another project.

I'm curious if it would be ported back later, and also how you do you plan to expose both GF(2^8) and (GF(2^16) or GF(2^32)) in the final API roughly.

rphmeier commented 5 years ago

Isn't LLVM really good at lifting loops to SIMD instructions? re-introducing it would not be hard but I'm not yet convinced that tricks like manual unrolling and SIMD invocations are not actually deoptimizations in general

darrenldl commented 5 years ago

I'm afraid not introducing it back would be detrimental in some use cases. The specific techniques (see paper) used allows 4x-10x speed up depending on CPU core count and SIMD support.

(My memory is fuzzy on this, so I might be wrong). More specifically the techniques introduce speed up of table lookup by using shuffle instructions. essentially allows 4-8 simultaneous table look ups instead of just one. In other words, it's way more than just speeding up loops.

Just to give some numbers, the pure Rust version (release build) runs at ~1GiB/s on my laptop, and the SIMD enabled one runs at ~4.5GiB/s.

TL;DR : LLVM is really good, but the techniques are highly non-trivial and required expert insight to develop, so automatic analysis can't possibly catch up with it.

darrenldl commented 5 years ago

(Misclicked, sorry)

darrenldl commented 5 years ago

@rphmeier Just heads up, I might not respond at all for next 2-3 weeks, apologies in advance. I will try to check in on weekends or whenever I'm free, however.

Also the following modules seem to be missing in lib.rs

mod matrix_tests;
mod inversion_tree_tests;

(feel free to change where to include them, e.g. lib_tests.rs, if you feel that'd be better)

Thanks very much for your work btw, let me know how you (and your colleagues) would like to be acknowledged in the README (or just add it to the PR).

rphmeier commented 5 years ago

It's a bit more idiomatic to have all tests in a tests sub-module -- I did a bit of restructuring (just finished) where all e.g. matrix or inversion_tree tests are kept in the same file. The lib-tests, I moved into a tests submodule in another file, because they were too big to put in lib.rs.

rphmeier commented 5 years ago

This can be closed as of #43

rust-rse / reed-solomon-erasure

Override limit on 256 shards #33