Compare with kagome's impl

ordian commented 1 year ago

alindima commented 11 months ago

I ran kagome's benchmarks on top of the optimisation in https://github.com/paritytech/reed-solomon-novelpoly/pull/24, without taking advantage of our SIMD optimisation. Note that this is a synthetic benchmark, only isolating the reed-solomon encode + decode procedures.

num_validators: 100 (the number of chunks) Apple M2 Pro with 32 Gib RAM

~~~ [ Benchmark case: 303 bytes ] ~~~
Encode RUST (100 cycles): 748 us
Decode RUST (100 cycles): 56.005 ms
Encode C++ (100 cycles): 530 us
Decode C++ (100 cycles): 30.887 ms

~~~ [ Benchmark case: 5007 bytes ] ~~~
Encode RUST (100 cycles): 8.986 ms
Decode RUST (100 cycles): 73.672 ms
Encode C++ (100 cycles): 5.406 ms
Decode C++ (100 cycles): 42.62 ms

~~~ [ Benchmark case: 100015 bytes ] ~~~
Encode RUST (100 cycles): 172.488 ms
Decode RUST (100 cycles): 429.697 ms
Encode C++ (100 cycles): 99.037 ms
Decode C++ (100 cycles): 279.038 ms

~~~ [ Benchmark case: 1000015 bytes ] ~~~
Encode RUST (100 cycles): 1723.59 ms
Decode RUST (100 cycles): 3842.43 ms
Encode C++ (100 cycles): 1008.16 ms
Decode C++ (100 cycles): 2571.68 ms

~~~ [ Benchmark case: 10000015 bytes ] ~~~
Encode RUST (100 cycles): 25.0707 s
Decode RUST (100 cycles): 39.4162 s
Encode C++ (100 cycles): 16.0018 s
Decode C++ (100 cycles): 26.1704 s

In a nutshell, according to this benchmark, the C++ implementation is:

encode 303 bytes: 41% faster
decode 303 bytes: 86% faster
encode 5007 bytes: 66% faster
decode 5007 bytes: 73% faster
encode ~100Kib: 73% faster
decode ~100Kib: 53% faster
encode ~1Mib: 70% faster
decode ~1Mib: 49% faster
encode ~10Mib: 56% faster
decode ~10Mib: 50% faster

As you can see, in general, regardless of the implementation, decoding time is much more significant than encoding time. Currently, parity's polkadot implementation only does decoding for data that is larger than 128Kib. Therefore, I would say that in the most important metrics, for decoding 1 Mib and decoding 10 Mib, the C++ implementation is 50% faster (not twice as fast).

However, optimisations such as systematic recovery will almost completely deny large-scale usage of the decoding procedure. Decoding from systematic chunks is 4000% faster than regular reconstruction. Therefore, after systematic recovery, the most significant metric will be encoding 1Mib and 10 Mib, which is still 50% to 70% faster with the C++ implementation.

Still, we need to measure how significant this difference is in a real-world scenario, where we test the entire availability-recovery process. I have used a modified version of https://github.com/paritytech/polkadot-sdk/compare/sandreim/subsystem-bench plus extra plumbing to use the C++ implementation to test this. It simulates a network with a max throughput and peer latencies and outputs metrics for the availability-recovery subsystem.

I'll post results soon

eskimor commented 11 months ago

Also, how it relates to validator count. Target is 1000. Does not really matter if one implementation is faster at 100, if it is slower with the number we are actually interested in.

sandreim commented 11 months ago

We need to test against some numbers we'd use in real world, 300, 500, 1000 are targets for us for example.

I've run our impl against these n_validators and turns out this reed solomon impl is fastest at 1000 validators

Posting this CPU burn chart to showcase that. This is a sequence run with (300, 500, 1000 valiators), you can spot which is better.

alindima commented 11 months ago

As suggested, reran the synthetic benchmark with 1000 validators.

Apple M2 Pro with 32 Gib RAM

~~~ [ Benchmark case: 303 bytes ] ~~~
Encode RUST (100 cycles): 2715 us
Decode RUST (100 cycles): 57.414 ms
Encode C++ (100 cycles): 2308 us
Decode C++ (100 cycles): 31.688 ms

~~~ [ Benchmark case: 5007 bytes ] ~~~
Encode RUST (100 cycles): 12.043 ms
Decode RUST (100 cycles): 76.579 ms
Encode C++ (100 cycles): 8.092 ms
Decode C++ (100 cycles): 43.482 ms

~~~ [ Benchmark case: 100015 bytes ] ~~~
Encode RUST (100 cycles): 197.27 ms
Decode RUST (100 cycles): 472.024 ms
Encode C++ (100 cycles): 117.672 ms
Decode C++ (100 cycles): 296.01 ms

~~~ [ Benchmark case: 1000015 bytes ] ~~~
Encode RUST (100 cycles): 1985.79 ms
Decode RUST (100 cycles): 4168.09 ms
Encode C++ (100 cycles): 1303.88 ms
Decode C++ (100 cycles): 2693.22 ms

~~~ [ Benchmark case: 2500015 bytes ] ~~~
Encode RUST (100 cycles): 4912.52 ms
Decode RUST (100 cycles): 10.4072 s
Encode C++ (100 cycles): 3014.82 ms
Decode C++ (100 cycles): 6.76459 s

~~~ [ Benchmark case: 5000015 bytes ] ~~~
Encode RUST (100 cycles): 9.93118 s
Decode RUST (100 cycles): 20.9447 s
Encode C++ (100 cycles): 6.12273 s
Decode C++ (100 cycles): 13.7247 s

~~~ [ Benchmark case: 10000015 bytes ] ~~~
Encode RUST (100 cycles): 27.7549 s
Decode RUST (100 cycles): 42.5262 s
Encode C++ (100 cycles): 19.7805 s
Decode C++ (100 cycles): 27.4793 s

C++ implementation is:

encode 303 bytes: 17% faster
decode 303 bytes: 83% faster
encode 5007 bytes: 48% faster
decode 5007 bytes: 76% faster
encode ~100Kib: 68% faster
decode ~100Kib: 59% faster
encode ~1Mib: 52% faster
decode ~1Mib: 54% faster
encode ~2.5Mib: 63% faster
decode ~2.5Mib: 55% faster
encode ~5Mib: 62% faster
decode ~5Mib: 52% faster
encode ~10Mib: 40% faster
decode ~10Mib: 54% faster

I reach the same conclusion, that in the synthetic benchmark, C++ implementation is 40-60% faster.

alindima commented 11 months ago

I also have the results of running with subsystem-bench with the following scenarios:

Regular chunk recovery in perfect network conditions
Systematic chunk recovery in perfect network conditions
Systematic chunk recovery with 1-100ms network latency

Same system, Apple M2 Pro with 32 Gib RAM, without our SIMD optimisations

Regular chunk recovery in perfect network conditions

Config:

TestConfiguration:
- objective: !DataAvailabilityRead
    fetch_from_backers: false
  n_validators: 1000
  n_cores: 40
  min_pov_size: 5120 (5Mib)
  max_pov_size: 5120 (5Mib)
  peer_bandwidth: 52428800
  bandwidth: 52428800
  latency:
    min_latency:
      secs: 0
      nanos: 0
    max_latency:
      secs: 0
      nanos: 0
  error: 0
  num_blocks: 5

Note that this simulates perfect network conditions (0 errors, 0 latency and enough bandwidth)

C++ Results:

CPU usage per block 10.17s

Rust results:

CPU usage per block 15.24s

C++ implementation consumes 50% less CPU when doing regular chunk recovery.

Systematic chunk recovery in perfect network conditions

Same configuration as regular recovery.

C++ Results:

CPU usage per block 4.49s

Rust results:

CPU usage per block 5.94s

C++ implementation consumes 32% less CPU when doing regular chunk recovery. This highlights only the encoding performance, as systematic recovery skips the decoding step.

Systematic chunk recovery in with 1-100ms network latency

The configuration difference is that per-request latency is added as a random number between 1ms and 100ms.

C++ Results:

CPU usage per block 4.97s

Rust results:

CPU usage per block 6.16s

C++ implementation consumes 23% less CPU when doing regular chunk recovery with added network latency.

Conclusion

kagome's performance advantage of 40-70% diminishes significantly as we implement systematic recovery and benchmark in realistic scenarios where network latency and errors exist. Even though it's faster and more efficient in theory, the real advantage as measured by me with systematic recovery and 1-100ms network latency drops to about 23%.

It could still be worthwhile to see if we can find the key differences between the implementations and why one is faster than the other. However, given these results and the work on systematic recovery, it's not fully obvious to me that availability-recovery will remain a bottleneck for scaling polkadot.

I'd like to also benchmark with our latest AVX implementation and see how that looks.

alindima commented 11 months ago

Ran roughly the same benchmarks with our rust impl with AVX.

Machine: GCP c2-standard-8

In kagome's synthetic benchmark:

Rust AVX implementation is just as fast as non-avx rust implementation for encoding.
Rust AVX implementation is 5-10% faster than non-avx rust implementation for decoding.
For very small data (tried with 300 and 5000 bytes),the avx implementation is actually 16-50% faster than CPP. However, this is not the case we are trying to optimise for.
The rest of proportions by which CPP is faster than rust non-avx are roughly maintained for the avx implementation.

In susbsystem-bench:

Compared to non-avx rust impl, the avx rust impl consumes 10% less CPU when doing systematic recovery in perfect network conditions.
Compared to CPP impl, the avx rust impl cosumes 23% more CPU when doing systematic recovery in perfect network conditions.

My conclusion is that when comparing systematic recovery with our AVX implementation, the CPP implementation consumes 23% less CPU in perfect network conditions. I expect this percentage to drop further in real network conditions with network latency and errors.

alindima commented 11 months ago

I've been comparing the two impls (rust and C++) in search for any differences. The two implementations are pretty much identical in terms of code. The only differences I could find where that the C++ one uses some local/thread-local variables instead of our global ones. I've modified our version and still no difference.

I've run the benchmarks under valgrind massif and there's an immense difference in terms of allocations/deallocations and memory usage. However, peak memory usage is identical.

for encoding 2.5 Mib of data, C++ allocates in total about 23.64 Mib. For the same procedure, the rust impl allocates three times the data: 61.91 Mib. Note that this is not peak usage, this is total usage. Another interesting fact is that the rust implementation does 10 times as much alloc/dealloc calls. Despite this, the number of total brk (the syscall used by the glibc allocator for growing the heap) calls is exactly the same.

After doing some micro-optimisations (mainly switching from iterators to for loops), I managed to get our impl to only do 5x more allocations and allocate only 2 times more data than the C++ one. However, this only resulted in a ~8% performance increase compared to the previous rust impl. Therefore, this can't be the only place where the inefficiency stems from.

I also compared cache misses (with cachegrind) and there is no significant difference that could justify performance difference.

I am starting to think that the C++ compiler is simply much better.

Another interesting fact is that the benchmark binary with rust is 36 times as large as the cpp one (with debug symbols stripped)

sandreim commented 11 months ago

I've been comparing the two impls (rust and C++) in search for any differences. The two implementations are pretty much identical in terms of code. The only differences I could find where that the C++ one uses some local/thread-local variables instead of our global ones. I've modified our version and still no difference.

I've run the benchmarks under valgrind massif and there's an immense difference in terms of allocations/deallocations and memory usage. However, peak memory usage is identical.

for encoding 2.5 Mib of data, C++ allocates in total about 23.64 Mib. For the same procedure, the rust impl allocates three times the data: 61.91 Mib. Note that this is not peak usage, this is total usage. Another interesting fact is that the rust implementation does 10 times as much alloc/dealloc calls. Despite this, the number of total brk (the syscall used by the glibc allocator for growing the heap) calls is exactly the same.

Thanks, this is quite interesting that an order of magnitude more allocations don't make up for anything in perf numbers.

After doing some micro-optimisations (mainly switching from iterators to for loops), I managed to get our impl to only do 5x more allocations and allocate only 2 times more data than the C++ one. However, this only resulted in a ~8% performance increase compared to the previous rust impl. Therefore, this can't be the only place where the inefficiency stems from.

8% is still a win, please publish the PR :)

I also compared cache misses (with cachegrind) and there is no significant difference that could justify performance difference.

Not sure how cachegrind works, but we should try this with intel perf counters on a bare metal machine.

I am starting to think that the C++ compiler is simply much better.

Another interesting fact is that the benchmark binary with rust is 36 times as large as the cpp one (with debug symbols stripped)

I can't believe it really is 30% better 😄 , we might still be missing something. I suggest to also look when building the binary which CPU we are targetting, maybe ”-C target-cpu=native”

alindima commented 11 months ago

8% is still a win, please publish the PR :)

Yes. I'll do that.

Not sure how cachegrind works, but we should try this with intel perf counters on a bare metal machine.

Yes, I'll do that too if I can get one.

maybe ”-C target-cpu=native”

Already did that, no difference

alindima commented 11 months ago

I opened a PR with the optimisation that brings the 8% improvement: https://github.com/paritytech/reed-solomon-novelpoly/pull/31

Initially, I thought it may also be thanks to a change that I had locally that cut allocations by 30% (apparently, that makes no difference).

I also explored whether it's a memory alignment issue, considering that we operate on vectors of u8 but the inner algorithm expects vector of u16. I tried using slice::align_to() - the vector was aligned for 2 bytes as well, so no issue here

alindima commented 11 months ago

I've ran the two encode-only implementations with perf to get some values of hardware performance counters.

For a reported duration difference of about 18% (cpp being the faster):

our impl executes almost twice as many instructions. considering that the algorithm is conceptually the same, I am inclined to believe this is because of better compiler optimisations for C++.
about 10% more total CPU cycles consumed by our rust impl.
rust has ~2.86 instructions per cycle, cpp has ~1.87 instructions per cycle
cpp has almost no misaligned cache accesses (0-8). rust has between 900-2000 misaligned accesses when encoding 2.5 Mib of data. According to the AMD kernel dev manual: These are accesses which cross a 64-byte boundary. They incur an extra cache access, and an extra cycle of latency on reads. Furthermore, misaligned atomic cache accesses trigger a memory bus lock (known as split lock). No such bus locks are reported for neither of the implementations.

alindima commented 11 months ago

I have some great news!

I successfully root-caused the big performance difference between the two implementations. Looking at the generated assembly for the afft and inverse_afft functions, I noticed a key difference being the bounds checks that rust adds to slice indexing.

After replacing regular panick-ing [] operator with unsafe {slice.get_unchecked()}, I was pleased to see performance being almost on par. Since the library performs a great number of slice indexing in loops, it makes sense that these add up and could also prevent compiler optimisations.

Turns out that indeed the rust safety guarantees were getting in the way. I think I can come up with an acceptable PR that relies on one-time assert!s for guaranteeing slice length and no out-of-bounds access. Will do that soon.

Here are the numbers on a c2-standard-8 with no AVX with the synthetic benchmark:

~~~ [ Benchmark case: 100000 bytes ] ~~~
Encode RUST (100 cycles): 214.619 ms
Decode RUST (100 cycles): 619.778 ms
Encode C++ (100 cycles): 207.604 ms
Decode C++ (100 cycles): 576.653 ms

~~~ [ Benchmark case: 1000000 bytes ] ~~~
Encode RUST (100 cycles): 2286.93 ms
Decode RUST (100 cycles): 5.26045 s
Encode C++ (100 cycles): 2313.82 ms
Decode C++ (100 cycles): 5.23196 s

~~~ [ Benchmark case: 2500000 bytes ] ~~~
Encode RUST (100 cycles): 5.5359 s
Decode RUST (100 cycles): 12.9864 s
Encode C++ (100 cycles): 6.09832 s
Decode C++ (100 cycles): 12.9724 s

~~~ [ Benchmark case: 5000000 bytes ] ~~~
Encode RUST (100 cycles): 11.846 s
Decode RUST (100 cycles): 25.8033 s
Encode C++ (100 cycles): 12.2738 s
Decode C++ (100 cycles): 25.7331 s

~~~ [ Benchmark case: 10000000 bytes ] ~~~
Encode RUST (100 cycles): 24.1998 s
Decode RUST (100 cycles): 51.4948 s
Encode C++ (100 cycles): 25.3328 s
Decode C++ (100 cycles): 51.4026 s

I'd say they are within the noise threshold of each other.

I'll run with AVX soon and post the result. I'm confident we'll be even faster.

Moreover, when running the subsystem-bench on avilability-recovery with no AVX we're already consuming around 10% less CPU with the rust implementation.

alindima commented 11 months ago

Posted PR for optimising/removing bounds checks for the non-avx code path (the one present in production): https://github.com/paritytech/reed-solomon-novelpoly/pull/34. I'll post another PR for the avx feature

alindima commented 11 months ago

I added similar bounds checks changes to the AVX code paths and it's actually a bit slower than the regular non-avx encoding for large data. It's only faster for small data sizes, up until around 100 Kib, which are the less significant (because the duration scales with the data size).

alindima commented 10 months ago

I think we can close this now, as we discovered the underlying difference. It was partly fixed in https://github.com/paritytech/reed-solomon-novelpoly/pull/34.

There are still a couple of bounds checks that I didn't manage to optimise without unsafe code. Left TODOs in the code for them.

burdges commented 10 months ago

Isn't 10mb our theoreticaly parachain blocksize? I think 50s sound way too slower than desired. This is single threaded?

We researchers should rereview this code for performance from a theory level, but almost surely this comes from cache pressure. Can you tell me the performacne if the cache is preppoulated? Aka run it a couple times and then run the timed instances without doing anything to clear cache.

ordian commented 10 months ago

I think the absolute numbers in the C++ benchmark are not per cycle (so divide by 100). On my machine, encoding is around 50MiB/s for 5MiB data, so around 100ms

burdges commented 10 months ago

Awsome, thanks. We should still tease out how much we lose to cash pressure, since that impacts our real performancein the node, so many another 10x hiding somewhere.

Already 50 MiB/s sounds compeditive with software AES on 10 year old hardware, so that's pretty solid. AES has a lookup table, but only a very tiny one.

I'd expected decode to be faster since it's AFFT has like half the size. It needs a setup phase, but this should amortize away for larger blocksizes. I've maybe forgotten some inefficency here, but it's really doing an AFFT of half the size.

paritytech / reed-solomon-novelpoly