supranational / blst

Multilingual BLS12-381 signature library
Apache License 2.0
474 stars 178 forks source link

[Optim] Runtime detection of MULX and ADX support #10

Closed mratsim closed 1 year ago

mratsim commented 4 years ago

Given the significant speed increase (~20% for high level signing/verifying) provided by BMI2 and ADX support (for MULX, ADCX, ADOX instructions) (bench: https://github.com/status-im/nim-blst/issues/1) It would probably worthwhile to autodetect CPU support for those instructions at runtime. This is especially interesting because the CPUs are now widespread (Broadwell, 2015 for Intel and Ryzen 2017 for AMD).

Cloudflare BN256 does that for example https://github.com/cloudflare/bn256/blob/3cac37b6/gfp_amd64.s#L108-L129

dot-asm commented 4 years ago

This is especially interesting because the CPUs are now widespread

Yes, and it's argued that they are so widespread that non-adx code is on the brink of irrelevant. At least close enough to grant compile-time detection. Especially since more and more software is distributed as source code.

Cloudflare BN256 does that for example

And? :-) :-) :-) But on serious note, performance is not all about benchmarking tight loop with short subroutine. What I'm trying to say is that conditional branch in a short subroutine is not optimal placement. It surely looks adequate in controlled benchmark environment, but in real life, where branch predictor is a limited resource, you'll risk misprediction penalties, and the said penalties are comparable to subroutine execution time. It can even be affected by linking order and other link-time optimizations. Run-time switch is better placed as high in call hierarchy as possible, say on point operation level. Or on whole library level, i.e. at compile time ;-)

dot-asm commented 4 years ago

At the end of the day performance is not necessarily about doing things in shortest possible time, but rather about meeting some kind of a deadline [preferably with confidence]. In other words if user-facing or "desktop-class" application is not dependent on 100% availability of CPU time for 24 hours a day, one doesn't have to feel bad about not using latest instruction extension over not-that-big margin. As long as application actually does the job 100%. Well, one can still make case that faster code would consume less power. Though one has to recognize that relation is far from linear, in sense that 20% faster code won't yield 20% savings. Because a sizable portion of power is spent on ~just~ merely being up... In other words I for one am not actually convinced that an application distributed in binary form for "mass consumption" and not using ADCX would actually be inadequate... Alternatively (and apart from most obvious "two-shared-libraries-to-choose-from-at-run-time") one can provide two binaries to choose from at installation time. Or install both binaries and provide a wrapper that would evaluate processor capability at application startup and execute one of the two. Or one can wrap one binary into another and have first binary extract second one in case it's needed. Kind of self-wrapper. Things might get tricky if you run as Windows service though. In which case installation-time evaluation would have to do. And if still unsuitable for some reason, let's talk about that, I'm sure I'll be able to offer a solution...

paulhauner commented 3 years ago

I'm also in favor of auto-detection.

one can provide two binaries to choose from at installation time

This is what we do presently. Here are some pain points:

Or install both binaries and provide a wrapper that would evaluate processor capability at application startup and execute one of the two

Is it really best to do this at the layer above BLST (i.e., the application that imports BLST)? As I understand it, using the faster CPU instructions is always better if you support them and always worse if you don't. In my experience of using this library for a several months, the only thing the application gets from the ability to choose is the risk of making a bad decision. If it turns out that some user really wants to ignore runtime detection for whatever reason then an "ignore runtime detection" runtime/compile flag could be used, ultimately placing us in a position of being "as fast as is safe" out-of-the-box.

Evaluating processor capability inside BLST means less code duplication across applications and it keeps the "which instruction set to use" concerns contained. If BLST finds benefit from some new CPU instruction then with self-contained runtime detection it can be implemented without breaking upstream apps (i.e., most of Eth2).

Or one can wrap one binary into another and have first binary extract second one in case it's needed. Kind of self-wrapper.

As above.

mratsim commented 3 years ago

I'll try to substantiate the feature request

Context

Performance

Current clients are squashing bottlenecks in sync, IO, hashing via various techniques including caching, in-memory DB and in general avoiding to do work. However there is one thing that we can't avoid, its verifying signatures and in all clients pairings are still one of the major bottlenecks. Having 20% faster verification is significant on the whole application.

For example, when syncing, the bottlenecks are disk speed, network speed and cryptography, on a fast network with a SSD, only cryptography remains. 20% faster verification translates to a couple of hours difference in syncing speed and will translate to days as the blockchain grows.

Ergonomics

Clients have a difficult time already to make blockchain safe and ergonomic to use. They try to abstract theoretical concept and low-level concept to the uninitiated.

A SIGILL breaks that abstraction.

Furthermore, consumers of clients like staking pools or packagers like DappNode might run on cloud services where CPUs are not stable and would be forced to use the portable build. A 20% performance difference would translate to significant cost saving. A runtime detection would translate to savings significant maintenance, documentation and debugging headaches. I expect all clients have "What to do if you receive SIGILL" in their documentation at the moment.

Analysis of runtime detection.

Using branches

I'd argue that even if CPU runtime detection is done via a branch at the lowest possible level, the cost is negligible. Case in point, both Gurvy and my own library, constantine, use runtime detection just before fp_mul and are faster than BLST on G1 doubling.

source eccbench 2021-01-29 (https://hackmd.io/@zkteam/eccbench#G1-mixed-additiondoubling2): image Code for detection:

Hardware predictors since Haswell have been significantly improved, they are improved so much that for interpreter dispatch (unpredictable with up to 255 opcodes!) it's better to use switch dispatching instead of a table of function pointers:

Furthermore while it's true that hardware predictors are limited in space, there are still 4096 Branch Target Buffers (https://www.agner.org/optimize/microarchitecture.pdf, https://xania.org/201602/bpu-part-one, https://stackoverflow.com/questions/16513943/how-can-i-get-my-cpus-branch-target-bufferbtb-size). As the library is fully constant-time with almost no branches otherwise, it is completely unused. I'm not sure what the situation on ARM is but BLST doesn't use branch predictors due to its constant-time nature.

Also the branch misprediction can only happen on the first call to multiplication would require at most 10 cycles misprediction/pipeline flush cost (https://lemire.me/blog/2019/10/15/mispredicted-branches-can-multiply-your-running-times/) but every other call would be free. A fp_mul operation would be 100 cycles with ADCX/ADOX or 120 cycles without, so out of the 12000+ multiplications you need for a full pairing, a 10 cycle misprediction cost doesn't even register as 0.0001%.

Lastly it's likely that the branch misprediction, if it happens, is hidden in the cache miss to fetch the Fp elements data from memory (which is needed whatever branch is chosen).

Not using branches

Alternatively, if branches needs to be avoided for other reasons, the technique from GCC function multiversioning can be used with an initial dispatch function that is stored in global memory and then replace itself with the proper function for the arch on the first call. This is detailed in https://www.agner.org/optimize/optimizing_cpp.pdf

image

Detecting at a high-level

We can indeed do CPU detection at a higher level however I think given the previous sections, this is extra code that saves a couple nanoseconds on a full pairing and significantly complexify builds.

dot-asm commented 3 years ago

I'll be replying in smaller "bites." (Unfortunately maybe not as fast as you'd wish/expect;-) Also, these won't be categorical "no"-s, but we have to recognize certain things.

Not using branches ... (*CriticalFunction)

Rendered unacceptable for cryptographic applications in January 2018. On a related note, here is a homework assignment. Compile all the different things with clang[++] -mretpoline and re-compare the results.

As the library is fully constant-time with almost no branches otherwise, it is completely unused.

Constant-time-ness doesn't mean that there are no conditional branches, only that they are not dependent on ~secret values~ data.

while it's true that hardware predictors are limited in space, there are still ...

Going back to the beginning of this remark, excessive amount of branch prediction logic is actually bad for security. It doesn't matter how well the processor can predict branches in a controlled[!] environment, as long one can indirectly affect the logic in some way, the only viable strategy for cryptographic code is to have as few conditional branches as possible, especially on critical paths. You might say "you're being over-paranoid." Yeah, but unfortunately such statements work only till somebody figures out something...

arnetheduck commented 2 years ago

I think most of us think of many of these functions at a higher level than the individual math operations - we sign and verify things - thus, one possible way forward would be to have the each of the specialized implementations for a "family" of instruction sets exported and allow both of them to be built into a single "build" by creating several "namespaces": blsSignGeneric, blsSignAdx and so on - we can then, when consuming the library, choose between these in any way we find reasonable (detection, compile-time selection etc).

karalabe commented 1 year ago

Hey hey heeey :) @dot-asm

We've hit this issue a few days ago and been thinking a lot on how to work around it. For us it's kind of important to support pre-compiled binaries since most people prefer docker workflows; but even those who do not generally have different build and execution environments.

Currently the only two options we have is to either blindly disable the ADX optimisations and make it slower for everybody; or to leave the optimisation in and hope for the best. The former seems to entail a significant hit on processing performance (measured in hours of extra processing time); whereas the latter is problematic in cloud environments where there's no direct control over the CPU allocation (or cross cloud deployments where it's even a bigger randomness on what CPU we get).

Would it be possible to pick up this idea of runtime detection of ADX instruction sets and run the faster or slower algorithm based on that? If you don't have the time to dive into the implementation yourself, would you be open to us sending in a PR to do it?

Thanks, Peter

holiman commented 1 year ago

I agree that having runtime detection would be hugely beneficial for consumers of this library, and would love to see this issue get resolved, the sooner the better.

If you don't have the time to dive into the implementation yourself, would you be open to us sending in a PR to do it?

?

sauliusgrigaitis commented 1 year ago

Currently the only two options we have is to either blindly disable the ADX optimisations and make it slower for everybody; or to leave the optimisation in and hope for the best.

The third option is two load two copies of blst (one for adx and one without adx) and during runtime check the CPU and call the right copy of blst. It's ugly, but we are going to try this approach in Grandine unless the proper solution will be in place at that time.

mratsim commented 1 year ago

How easy is it to auto-namespace the library? Is there a compiler flag we can turn on for this or does it require a fork?

arnetheduck commented 1 year ago

Is there a compiler flag we can turn on for this or does it require a fork?

objcopy --redefine-sym works for static libraries - it's not .. hard, just messy - it's the kind of thing that is likely easier to maintain in a fork of blst that exists only for the purpose of naming symbols at compilation stage.

mratsim commented 1 year ago

Addressed on Jun 12, 2023:

And SHA256 runtime detection as well: