supranational / blst

Multilingual BLS12-381 signature library
Apache License 2.0
454 stars 171 forks source link

Rust: Use rayon and better performance #203

Open nazar-pc opened 5 months ago

nazar-pc commented 5 months ago

This PR introduces a bunch of changes that target at better performance and higher code quality.

Performance

Before:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [959.44 µs 973.55 µs 988.39 µs]
aggregate_verify/aggregate_verify/16                                                                             
                        time:   [1.0332 ms 1.0502 ms 1.0672 ms]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0678 ms 1.0764 ms 1.0854 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6251 ms 1.6368 ms 1.6503 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [3.1520 ms 3.1741 ms 3.1962 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.4169 ms 1.4476 ms 1.4795 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.4360 ms 1.4616 ms 1.4874 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4743 ms 1.4874 ms 1.5005 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.4119 ms 2.4302 ms 2.4486 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [4.1066 ms 4.1342 ms 4.1630 ms]

After:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [932.61 µs 942.15 µs 952.58 µs]
aggregate_verify/aggregate_verify/16                                                                            
                        time:   [959.34 µs 968.31 µs 978.62 µs]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0789 ms 1.0993 ms 1.1215 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6040 ms 1.6177 ms 1.6329 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [2.7126 ms 2.7322 ms 2.7533 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.2968 ms 1.3164 ms 1.3372 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.3326 ms 1.3475 ms 1.3630 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4790 ms 1.5010 ms 1.5250 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.1851 ms 2.2054 ms 2.2271 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [3.6696 ms 3.6972 ms 3.7270 ms]

The reason is most likely due to higher parallelism and not using channels in those operations.

Rayon

Rayon doesn't always result in better performance than manual thread management (see https://github.com/rayon-rs/rayon/issues/1124), but changes in this PR do result in higher performance.

There are many great benefits to rayon though in the way it is used:

Various cleanups

I changed all but one channel usage with a cleaner code using Mutex, Arc usage was removed as unnecessary and some other bookkeeping was cleaned up after understanding what code does and rewriting intention in more idiomatic Rust code (see fn from for example). The only place where channel still remains is fn mult, but it is quite difficult code and I didn't have the will to decipher what it tries to do to get rid of channel, but hopefully examples in this PR will prompt someone to clean it up at some later point.

Some unsafe blocks with vec.set_len() are moved after memory is actually initialized and some questionable in terms of soundness places were refactored such that they are quite obviously sound now. Some remaining places can also be fixed, but require pointer arithmetic that might require newer versions of Rust, so I left TODOs there.

I also took liberty to refactor things like &q[0] to q.as_ptr() since while both result in the same pointer, the former is deceptive because it suggests to the reader that only the first element is used, while in practice the whole q is used by the function that is being called later (documentation is almost non-existent, so it wasn't obvious at first and I hope this improves readability).

no-threads

no-threads was a hack and non-idiomatic because it violates "additivity" assumptions of Rust features. Thankfully it is no longer necessary and can be safely removed, but since this is a breaking change (the only one in this PR to the best of my knowledge), I bumped crate version from 0.3.11 to 0.4.0.

How to review this PR

This PR is split into self-contained logically structured commits and it is recommended to review commits one by one instead of the complete diff, this way it is much easier to understand what was changed and why.

Let me know if there are any questions, happy to explain each line change in detail.

mratsim commented 5 months ago

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

nazar-pc commented 5 months ago

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

rayon is a well known established library with short list of mandatory dependencies and it used by Rustc itself, so it is not like you're adding a lot of new stuff here.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

I generally agree with you on this one. However, as I have shown in the description already, there were some questionable decisions from soundness point of view around thread pool as it was used and performance was lower than it could have been.

You 100% can rewrite original code and make it faster without rayon, achieving rayon's features in a safe misuse-resistant way is non-trivial and I'd rather use something that it maintained by Rust experts for this purpose (check top contributors to rayon project, they are literally working on Rust itself) than seeing every library invent their own a bit flawed version of the same features that doesn't integrated as well with the rest of Rust ecosystem.

I would estimate risk of relying on rayon as very low, but decision is up to blst maintainers of course.

nazar-pc commented 5 months ago

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

mratsim commented 5 months ago

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

I don't know whether commitment, proof generation or verification is the bottleneck for your use-case but @sauliusgrigaitis might be able to point you on how to best use rust-kzg, i.e. either through native BLST or reimplemented+parallelized BLST (i.e. using their own multithreading strategy).

Also they've been working on a new set of benchmarks. I have my own in https://github.com/mratsim/constantine/pull/304#issuecomment-1844795359, but I don't cover parallel BLST.

nazar-pc commented 3 months ago

@dot-asm is there something I can do to interest you reviewing this?

nazar-pc commented 1 week ago

This os honestly frustrating. There are merge conflicts again and not review for clear performance and API improvements in half a year :confused:

@dot-asm any chance you can look into this any time soon? Any feedback at all would be nice.