optimise some of the bounds checks

This brings a performance improvement of 20-30%.

Where possible, compiler is aided to optimise away the bounds checks without any unsafe code. No unsafe code was used.

This PR does not touch AVX code, because when testing, I did not see a noticeable improvement for that case.

Numbers before:

~~~ [ Benchmark case: 1000000 bytes ] ~~~
Encode RUST (10 cycles): 397.182 ms
Decode RUST (10 cycles): 961.006 ms
Encode C++ (10 cycles): 221.003 ms
Decode C++ (10 cycles): 572.489 ms

~~~ [ Benchmark case: 2500000 bytes ] ~~~
Encode RUST (10 cycles): 1018.27 ms
Decode RUST (10 cycles): 2424.6 ms
Encode C++ (10 cycles): 602.386 ms
Decode C++ (10 cycles): 1459.45 ms

~~~ [ Benchmark case: 5000000 bytes ] ~~~
Encode RUST (10 cycles): 2043.65 ms
Decode RUST (10 cycles): 4813.27 ms
Encode C++ (10 cycles): 1208.36 ms
Decode C++ (10 cycles): 2892.14 ms

~~~ [ Benchmark case: 10000000 bytes ] ~~~
Encode RUST (10 cycles): 4095.26 ms
Decode RUST (10 cycles): 9.61965 s
Encode C++ (10 cycles): 2416.67 ms
Decode C++ (10 cycles): 5.76792 s

Numbers now:

~~~ [ Benchmark case: 1000000 bytes ] ~~~
Encode RUST (10 cycles): 335.291 ms -> 18.5% better than master
Decode RUST (10 cycles): 739.528 ms -> 30% better than master
Encode C++ (10 cycles): 211.608 ms
Decode C++ (10 cycles): 562.939 ms

~~~ [ Benchmark case: 2500000 bytes ] ~~~
Encode RUST (10 cycles): 855.59 ms -> 19% better than master
Decode RUST (10 cycles): 1830.64 ms -> 32% better than master
Encode C++ (10 cycles): 559.997 ms
Decode C++ (10 cycles): 1434.1 ms

~~~ [ Benchmark case: 5000000 bytes ] ~~~
Encode RUST (10 cycles): 1730.34 ms -> 18% better than master
Decode RUST (10 cycles): 3633.53 ms -> 32% better than master
Encode C++ (10 cycles): 1177.38 ms
Decode C++ (10 cycles): 2863.14 ms

~~~ [ Benchmark case: 10000000 bytes ] ~~~
Encode RUST (10 cycles): 3475.36 ms -> 17.8% better than master
Decode RUST (10 cycles): 7.25712 s -> 32.5% better than master
Encode C++ (10 cycles): 2372.91 ms
Decode C++ (10 cycles): 5.7262 s

The only thing preventing from this implementation being as fast as kagome's C++ impl are the few lines annotated with:

// TODO: Optimising bounds checks on this line will yield a great performance improvement.

I couldn't yet manage to get the compiler to ellide the bounds checks in those cases. Another way of achieving this would be to add a bit of unsafe code.

paritytech / reed-solomon-novelpoly

optimise some of the bounds checks #34