supranational / blst

Multilingual BLS12-381 signature library
Apache License 2.0
478 stars 179 forks source link

Rust: Use rayon and better performance #203

Open nazar-pc opened 9 months ago

nazar-pc commented 9 months ago

This PR introduces a bunch of changes that target at better performance and higher code quality.

Performance

Before:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [959.44 µs 973.55 µs 988.39 µs]
aggregate_verify/aggregate_verify/16                                                                             
                        time:   [1.0332 ms 1.0502 ms 1.0672 ms]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0678 ms 1.0764 ms 1.0854 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6251 ms 1.6368 ms 1.6503 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [3.1520 ms 3.1741 ms 3.1962 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.4169 ms 1.4476 ms 1.4795 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.4360 ms 1.4616 ms 1.4874 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4743 ms 1.4874 ms 1.5005 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.4119 ms 2.4302 ms 2.4486 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [4.1066 ms 4.1342 ms 4.1630 ms]

After:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [932.61 µs 942.15 µs 952.58 µs]
aggregate_verify/aggregate_verify/16                                                                            
                        time:   [959.34 µs 968.31 µs 978.62 µs]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0789 ms 1.0993 ms 1.1215 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6040 ms 1.6177 ms 1.6329 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [2.7126 ms 2.7322 ms 2.7533 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.2968 ms 1.3164 ms 1.3372 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.3326 ms 1.3475 ms 1.3630 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4790 ms 1.5010 ms 1.5250 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.1851 ms 2.2054 ms 2.2271 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [3.6696 ms 3.6972 ms 3.7270 ms]

The reason is most likely due to higher parallelism and not using channels in those operations.

Rayon

Rayon doesn't always result in better performance than manual thread management (see https://github.com/rayon-rs/rayon/issues/1124), but changes in this PR do result in higher performance.

There are many great benefits to rayon though in the way it is used:

Various cleanups

I changed all but one channel usage with a cleaner code using Mutex, Arc usage was removed as unnecessary and some other bookkeeping was cleaned up after understanding what code does and rewriting intention in more idiomatic Rust code (see fn from for example). The only place where channel still remains is fn mult, but it is quite difficult code and I didn't have the will to decipher what it tries to do to get rid of channel, but hopefully examples in this PR will prompt someone to clean it up at some later point.

Some unsafe blocks with vec.set_len() are moved after memory is actually initialized and some questionable in terms of soundness places were refactored such that they are quite obviously sound now. Some remaining places can also be fixed, but require pointer arithmetic that might require newer versions of Rust, so I left TODOs there.

I also took liberty to refactor things like &q[0] to q.as_ptr() since while both result in the same pointer, the former is deceptive because it suggests to the reader that only the first element is used, while in practice the whole q is used by the function that is being called later (documentation is almost non-existent, so it wasn't obvious at first and I hope this improves readability).

no-threads

no-threads was a hack and non-idiomatic because it violates "additivity" assumptions of Rust features. Thankfully it is no longer necessary and can be safely removed, but since this is a breaking change (the only one in this PR to the best of my knowledge), I bumped crate version from 0.3.11 to 0.4.0.

How to review this PR

This PR is split into self-contained logically structured commits and it is recommended to review commits one by one instead of the complete diff, this way it is much easier to understand what was changed and why.

Let me know if there are any questions, happy to explain each line change in detail.

mratsim commented 9 months ago

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

nazar-pc commented 9 months ago

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

rayon is a well known established library with short list of mandatory dependencies and it used by Rustc itself, so it is not like you're adding a lot of new stuff here.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

I generally agree with you on this one. However, as I have shown in the description already, there were some questionable decisions from soundness point of view around thread pool as it was used and performance was lower than it could have been.

You 100% can rewrite original code and make it faster without rayon, achieving rayon's features in a safe misuse-resistant way is non-trivial and I'd rather use something that it maintained by Rust experts for this purpose (check top contributors to rayon project, they are literally working on Rust itself) than seeing every library invent their own a bit flawed version of the same features that doesn't integrated as well with the rest of Rust ecosystem.

I would estimate risk of relying on rayon as very low, but decision is up to blst maintainers of course.

nazar-pc commented 9 months ago

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

mratsim commented 9 months ago

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

I don't know whether commitment, proof generation or verification is the bottleneck for your use-case but @sauliusgrigaitis might be able to point you on how to best use rust-kzg, i.e. either through native BLST or reimplemented+parallelized BLST (i.e. using their own multithreading strategy).

Also they've been working on a new set of benchmarks. I have my own in https://github.com/mratsim/constantine/pull/304#issuecomment-1844795359, but I don't cover parallel BLST.

nazar-pc commented 7 months ago

@dot-asm is there something I can do to interest you reviewing this?

nazar-pc commented 4 months ago

This os honestly frustrating. There are merge conflicts again and not review for clear performance and API improvements in half a year :confused:

@dot-asm any chance you can look into this any time soon? Any feedback at all would be nice.

dot-asm commented 3 months ago

As correctly suggested by @mratsim the threadpool was chosen for having minimum dependencies. Another important reason is that the parallelization pattern is used in both Rust and Go bindings. As a way to minimize the maintenance cost. I mean we only need to understand one pattern and solving problems applies to both implementations. It also gives better control by being more predictable. For example consider

use std::sync::atomic::*;

fn main() {
    rayon::spawn(move || {
        loop {}
    });
    let counter = AtomicUsize::new(0);
    rayon::scope(|scope| {
        scope.spawn_broadcast(|_, _| {
            println!("{}", counter.fetch_add(1, Ordering::Relaxed));
        });
    });
    println!("Hello, {} cpus!", counter.load(Ordering::Acquire));
}

The keyword is that scope.spawn_broadcast never finishes here. In other words the invocations in question are dependent on the progress being made by independent threads. Which is unsustainable in real-life application context.

As for benchmark performance. It appears to be system-dependent and even circumstantial. On some systems I failed to note significant improvements. On some it's indeed better, but only marginally. On some systems I can observe significant variations from run to run with either approach, with the original occasionally outperforming the suggested approach on some points. It ought to be because of unfortunate interplay with the OS scheduler and power management... Either way, it made me dive till my eardrums ruptured and found that if modified, the threadpool can perform way better on larger systems. Here is a sample from a 128-core one:

aggregate_verify/aggregate_verify/8                        
                        time:   [1.3980 ms 1.4023 ms 1.4076 ms]
aggregate_verify/aggregate_verify/16                        
                        time:   [1.4829 ms 1.4842 ms 1.4854 ms]
aggregate_verify/aggregate_verify/32                        
                        time:   [1.3983 ms 1.4174 ms 1.4407 ms]
aggregate_verify/aggregate_verify/64                        
                        time:   [2.3176 ms 2.3211 ms 2.3248 ms]
aggregate_verify/aggregate_verify/128                        
                        time:   [2.3265 ms 2.3358 ms 2.3456 ms]
verify_multi_aggregate/verify_multi_aggregate/8                        
                        time:   [2.1803 ms 2.1849 ms 2.1896 ms]
verify_multi_aggregate/verify_multi_aggregate/16                        
                        time:   [1.8280 ms 1.8365 ms 1.8486 ms]
verify_multi_aggregate/verify_multi_aggregate/32                        
                        time:   [2.3775 ms 2.3800 ms 2.3825 ms]
verify_multi_aggregate/verify_multi_aggregate/64                        
                        time:   [3.2475 ms 3.2542 ms 3.2602 ms]
verify_multi_aggregate/verify_multi_aggregate/128                        
                        time:   [3.2801 ms 3.3056 ms 3.3323 ms]

vs. this PR's (which is about the same as the main branch on average)

aggregate_verify/aggregate_verify/8                        
                        time:   [1.7770 ms 1.8037 ms 1.8274 ms]
aggregate_verify/aggregate_verify/16                        
                        time:   [1.9822 ms 2.0084 ms 2.0334 ms]
aggregate_verify/aggregate_verify/32                        
                        time:   [2.3289 ms 2.3367 ms 2.3444 ms]
aggregate_verify/aggregate_verify/64                        
                        time:   [2.8085 ms 2.8238 ms 2.8383 ms]
aggregate_verify/aggregate_verify/128                        
                        time:   [3.3212 ms 3.3420 ms 3.3630 ms]
verify_multi_aggregate/verify_multi_aggregate/8                        
                        time:   [2.3192 ms 2.3420 ms 2.3661 ms]
verify_multi_aggregate/verify_multi_aggregate/16                        
                        time:   [2.8061 ms 2.8209 ms 2.8352 ms]
verify_multi_aggregate/verify_multi_aggregate/32                        
                        time:   [2.9861 ms 3.0076 ms 3.0291 ms]
verify_multi_aggregate/verify_multi_aggregate/64                        
                        time:   [3.3539 ms 3.3698 ms 3.3875 ms]
verify_multi_aggregate/verify_multi_aggregate/128                        
                        time:   [5.1063 ms 5.1151 ms 5.1247 ms]

On smaller systems, such as personal computers, it's kind all over the place. In some points the difference is not as significant, in others is... If you want to test it on your system add

[patch.crates-io]
threadpool = { git = "https://github.com/dot-asm/rust-threadpool", branch = "scheduling-agility" }

to your .cargo/config.toml. It needs more work and testing before bringing it to maintainers' attention, but one can test... I mean do test :-)

nazar-pc commented 3 months ago

As correctly suggested by @mratsim the threadpool was chosen for having minimum dependencies.

I understand the appeal of minimal dependencies, but as described above rayon is a well know library maintained by very experienced and respected engineers, so I don't think there is a huge concern in terms of either security or compilation time or other issues that are typically attribute to larger number of dependencies.

Another important reason is that the parallelization pattern is used in both Rust and Go bindings. As a way to minimize the maintenance cost. I mean we only need to understand one pattern and solving problems applies to both implementations.

This makes sense as well, but I should note that this shouldn't be a hard requirement since languages are in fact substantially different and translating code directly from one language to another can and does often lead to awkward ergonomics and even performance issues. Not to say that tricky unsafe code that only works correctly by accident is not good for maintenance, it is much easier to deal with safe code that can't lead to safety violation by definition, which is especially true during refactoring.

The keyword is that scope.spawn_broadcast never finishes here. In other words the invocations in question are dependent on the progress being made by independent threads. Which is unsustainable in real-life application context.

That code contains literally an infinite loop. It is a problem one way or another. The whole point is that those are NOT independent threads, they are specific threads on a fixed size thread pool that is context-dependent.

It is a much more flexible approach as I have described in the original post than assuming library can use all of the cores and let OS scheduler figure out where to run it or forcing all usages to use a single CPU core with nothing in between.

As for benchmark performance. It appears to be system-dependent and even circumstantial. On some systems I failed to note significant improvements. On some it's indeed better, but only marginally. On some systems I can observe significant variations from run to run with either approach, with the original occasionally outperforming the suggested approach on some points. It ought to be because of unfortunate interplay with the OS scheduler and power management...

I consider this as a good thing. Because it didn't regress and provides substantial flexibility without changing the library that are not possible otherwise. I do notice large improvements in performance from pinning threads to cores on Threadripper/Epyc and even Ryzen 7/9 CPUs in many cases (crossing chiplets and sockets is very costly sometimes), yet currently because the library uses its own independent thread pool I can't do anything about it without forking.

Either way, it made me dive till my eardrums ruptured and found that if modified, the threadpool can perform way better on larger systems. Here is a sample from a 128-core one:

Very nice! Changes make a lot of sense to me, for sensitive contexts channel is not the most efficient data structure and simple Mutex/Condvar pair is all you really need. I wouldn't be surprised if similar gains can be made in rayon as well.

dot-asm commented 3 months ago

The keyword is that scope.spawn_broadcast never finishes here. In other words the invocations in question are dependent on the progress being made by independent threads. Which is unsustainable in real-life application context.

That code contains literally an infinite loop.

Come on, just imagine say a socket listener instead of the infinite loop. Or replace it with sleep for a second and perform signature verification in the spawn_broadcast. Are you ok with the signature verification taking a full second? I'm not. A library simply may not bring this upon the user.

nazar-pc commented 3 months ago

Come on, just imagine say a socket listener instead of the infinite loop.

You misunderstand what rayon is and what are the use cases for it. I'll quote the first line of their readme:

Rayon is a data-parallelism library for Rust. It is extremely lightweight and makes it easy to convert a sequential computation into a parallel one.

If you use it for socket listener you're using it wrong. Rayon is for compute-intensive work, for I/O you'll be using a regular thread or a dedicated I/O thread pool (that can in fact use rayon as well, you can run a piece of code in the context of a custom thread pool with Rayon in contrast to what blst currently does with its unconditional uncontrollable embedded thread pool). For example in our software we create custom dedicated rayon thread pools with at most 32 threads for I/O purposes even on machine with 256 hardware threads available.

Or replace it with sleep for a second and perform signature verification in the spawn_broadcast. Are you ok with the signature verification taking a full second? I'm not. A library simply may not bring this upon the user.

This is another great example of how to use the wrong tool for the job. Neither of these examples are directly applicable to what this PR does. Yes, rayon will use global thread pool by default, but you also have an option to create a custom one as described in the very first post, such that random sleeps and I/O in other parts of the app are not fundamentally blocking any work that is being done by blst. The power is in the hands of the user of the library, the power that currently released blst doesn't expose.

dot-asm commented 3 months ago

The examples were obviously exaggerated to illuminate the thesis that scope.spawn_broadcast is considered an inappropriate choice for this library. Because we don't make the assumption that the application makes calls from the same thread and the trouble is that faster operations get serialized by slower ones. In case you say that the current approach serializes operations too, well, not if you have say a 30-thread operation and a 2-thread operation on a 32-core system.

nazar-pc commented 3 months ago

The examples were obviously exaggerated to illuminate the thesis that scope.spawn_broadcast is considered an inappropriate choice for this library.

Every single example you provided was a literal misuse of the library for things that it is not designed to be used for, while this PR uses it exactly for its direct purpose: relatively short-lived parallelizable compute that doesn't have sleeps or I/O, etc. I have no idea how you came to conclusion "considered an inappropriate choice for this library", I can't connect the dots :confused:

rayon is a great choice for this exact purpose and used widely from Rust compiler itself to numerous libraries in Rust ecosystem. Yes, it is probably not the way you do things in C or Golang, but Rust is not C or Golang, so why not make experience for Rust developers better?

nazar-pc commented 3 months ago

rayon is not a generic thread pool on which you spawn random blocking tasks, it is a thread pool of a fixed size (size which you control as an application developer) in which you can control affinity of threads and other things (again as application developer) that is used to parallelize compute-intensive operations.

Being afraid that someone uses it for sockets or puts useless sleeps in rayon threads and use that as justification for using of uncontrollable internal thread pool instead is simply absurd :man_shrugging:

nazar-pc commented 3 months ago

To give you very specific example we use blst as indirect dependency in https://github.com/autonomys/space-acres. In this desktop app we have an option to limit CPU usage to just half of the cores so user can keep the app running and not experience sluggish desktop due to high CPU utilization even though the app is still making a decent progress. Right now we use blst from a fork with this PR included so we can reduce thread pool size and pick to a specific set of cores depending on CPU topology (NUMA, L3/L2 cache locality, SMT/no-SMT cores on hybrid Intel CPUs, etc.) to ensure a great user experience.

Out of probably a thousand of Rust dependencies blst is the only unique one in that it uses internal thread pool for CPU-intensive operations that we have zero control over without this PR.

If PR is not accepted we will either have to switch to upstream and affect user experience or maintain a repeatedly diverging fork with this change set in it ourselves. Neither of these options are ideal, which is why I opened a PR and tried to not only keep the performance the same, but even improve it in some cases as well as improve code quality.

dot-asm commented 3 months ago

Once again, if you don't assume that all calls to this library are made from the same thread, which we don't, then a spawn_broadcast call [made by this library], that otherwise takes a pair of threads and 1 ms, made 1 ms after a 5-ms otherwise 30-thread one [also made by this library] will finish in 4+ ms.

As for affinity, threadpool does take into consideration the process' affinity mask upon instantiation.

nazar-pc commented 3 months ago

Once again, if you don't assume that all calls to this library are made from the same thread, which we don't, then a spawn_broadcast call [made by this library], that otherwise takes a pair of threads and 1 ms, made 1 ms after a 5-ms otherwise 30-thread one [also made by this library] will finish in 4+ ms.

What you're writing makes sense hypothetically in a vacuum. I already explained in this thread a few times why this is a hypothetical issue. If you really have designed an app where you have literally no idea what is running, how and where (in which case I'm really sorry), you can still create an explicit thread pool and call bslt under it to ensure no unexpected overlap with other operations:

let thread_pool = rayon::ThreadPoolBuilder::default().build().unwrap();
thread_pool.install(|| {
    // Do blst stuff on a dedicated thread pool here
});

Read the docs about it here: https://docs.rs/rayon/latest/rayon/struct.ThreadPool.html#method.install I hope a very specific example helps to understand that you can have guaranteed dedicated thread pool if you really want it and let kernel decide what tasks to run and when.

As for affinity, threadpool does take into consideration the process' affinity mask upon instantiation.

That is not a replacement for what I'm asking, not even close. To give you another example in https://github.com/autonomys/subspace we also check topology of CPU and split work into multiple thread pools, each pinned to a specific NUMA node (mostly Intel Xeon) or CCX (on AMD Threadripper/Epyc) to ensure threads do not do expensive migration, which results in significant performance wins. All this is possible with the changes I'm proposing and impossible with the stock blst. And for some use cases we have another collection of thread pools that do kind of the same, but only utilize half of the cores in each NUMA node/CCD, so depending on use case we call .install() on one thread pool or the other.

You just can't make those decisions at the process level. You know about compute performance way more than I do, not sure why I have to explain this in so much detail.

nazar-pc commented 2 months ago

Just to give you perspective to my claims I did benchmarks of this PR on AMD 7970X (32C64T 4 CCD).

Using all CPU cores (at the mercy of default thread scheduling):

verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.0208 ms 1.0286 ms 1.0358 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.1221 ms 1.1282 ms 1.1345 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.3031 ms 1.3095 ms 1.3171 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [1.6318 ms 1.6447 ms 1.6579 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [2.2720 ms 2.2791 ms 2.2865 ms]

Limiting to just 1 CCD (8C16T):

verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.2282 ms 1.2297 ms 1.2311 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.3055 ms 1.3060 ms 1.3067 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.9418 ms 1.9446 ms 1.9470 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [3.1277 ms 3.1311 ms 3.1354 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                            
                        time:   [5.7414 ms 5.7647 ms 5.7878 ms]

As you can see, depending on benchmark there is potential to do 1.5x-3.5x more work with this PR in the same app by strategically pinning CPU cores for different thread pools, which is not possible with current master.

Is there any chance this PR can be merged or am I wasting my time here?