ralexstokes / ssz-rs

Implementation of ethereum's `ssz`
Apache License 2.0
103 stars 41 forks source link

Support `hashtree` crate with feature, add Criterion benchmarks #161

Open mempirate opened 2 months ago

mempirate commented 2 months ago

Implements a new hash_chunks method that is used everywhere and will vary on implementation based on the features enabled.

Related to #156 but further optimizations are possible.

Notes

Benchmarks

sha2 (default)

hash_tree_root_sha256   time:   [49.648 µs 49.814 µs 50.056 µs]
                        change: [-0.3203% +0.5809% +1.7785%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking generate_proof_sha256: Warming up for 3.0000 s
generate_proof_sha256   time:   [811.00 ms 815.02 ms 819.58 ms]
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

sha2-asm

hash_tree_root_sha256-asm
                        time:   [10.242 µs 10.252 µs 10.263 µs]
                        change: [+0.3393% +0.5229% +0.6984%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking generate_proof_sha256-asm: Warming up for 3.0000 s
generate_proof_sha256-asm
                        time:   [132.81 ms 133.67 ms 135.02 ms]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe

hashtree

hash_tree_root_hashtree time:   [8.2075 µs 8.2200 µs 8.2385 µs]
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

Benchmarking generate_proof_hashtree: Warming up for 3.0000 s
generate_proof_hashtree time:   [111.12 ms 111.86 ms 112.93 ms]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

In these benchmarks, hashtree is ~16% faster at generating proofs and ~20% faster at calculating hash tree roots compared to sha2-asm.

mempirate commented 2 months ago

With the latest commit getting rid of heap allocation for chunks and replacing it with an array, benchmarks were even better for hashtree:

hash_tree_root_hashtree time:   [7.8244 µs 7.8345 µs 7.8479 µs]
                        change: [-4.8425% -4.6332% -4.4083%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

Benchmarking generate_proof_hashtree: Warming up for 3.0000 s
generate_proof_hashtree time:   [97.430 ms 97.634 ms 97.872 ms]
                        change: [-13.578% -12.719% -12.096%] (p = 0.00 < 0.05)
                        Performance has improved.
potuz commented 2 months ago

I think this shouldn't close #156 this PR creates a wrapper that hashes two chunks, not necessarily contiguous with hashtree. As such it only benefits from the padding block prescheduling of hashtree but none of the vectorized/parallel pipelining. This is why you are getting only 16%--20% benefits from it. When hashing two chunks at the time you also incur on copies and allocations for every chunk you hash, something that is utterly unnecessary in most situations.

The approach to actually gain from this library is to hash one entire layer at the time, by passing all the layer, contiguously allocated, to the chunks slice in the function https://github.com/prysmaticlabs/hashtree/blob/main/examples/basic_usage.rs#L36

You can hash a Merkle tree of base 2^{N+1} by allocating only 2^N for the first layer, and then rewriting that allocated layer on successive runs, by passing the same slice as chunks and out.

mempirate commented 2 months ago

@potuz agreed this doesn't resolve the whole issue. I will follow this up with another PR for the other performance improvements you mentioned.