rust-rse / reed-solomon-erasure

[Looking for new owners/maintainers, see #88] Rust implementation of Reed-Solomon erasure coding
MIT License
174 stars 60 forks source link

SIMD for GF(2^16) #80

Open burdges opened 3 years ago

burdges commented 3 years ago

We invoke C code for SIMD in galois_8.rs but galois_16.rs merely uses galois_8.rs.

There is an more bespoke approach to GF(2^16) in Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions by James Plank, Kevin Greenan, and Ethan Miller.

@drskalman You looked back into the gf-complete C code recently, yes? I think https://github.com/ceph/gf-complete/blob/master/src/gf_w16.c implements their proposed SIMD for GF(2^16), so can tell if it's faster?

Replaces #48 I think

AndersTrier commented 7 months ago

There is an more bespoke approach to GF(2^16) in Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions by James Plank, Kevin Greenan, and Ethan Miller.

I've implemented exactly that in pure Rust for AVX2, SSSE3 and Neon. Feel free to draw inspiration from https://github.com/AndersTrier/reed-solomon-simd/tree/master/src/engine

burdges commented 7 months ago

At some point I figured out the approach by this crate winds up being obsolete: https://hackmd.io/@rgbPIkIdTwSICPuAq67Jbw/r1B8eRC1u

In https://github.com/paritytech/reed-solomon-novelpoly we ported the leopard C code to rust, which uses additive FFTs and the formal derivative trick, and fixed its handling of gf(2^16) and low rate. Afaik everyone else who cared about erasure coding performance uses the leopard C code directly.

In https://github.com/w3f/rs-ec-perf we'd various optimization ideas for which we made only half-assed attempts like GFNI, and others we never even tried, well other priorities.

I'd conjecture the optimization really worth doing would be doing gf(2^12) as a degree three extension over gf(2^4) with the extension arithmetic implemented using the multi log table form. Also maybe gf(2^16) as a degree four extension over gf(2^16) using the multi log table form, but that becomes really nasty. Instead of having big tables, these need only a bunch of gf(2^4) tables, so they avoid cache pressure. It's possible this out performs in production, even if the big table ones outperform in pure erasure coding benchmarks.