supranational / blst

Multilingual BLS12-381 signature library
Apache License 2.0
448 stars 169 forks source link

ARMv7 optimization #213

Open timethernom opened 2 months ago

timethernom commented 2 months ago

I would like to speed up signature verification using ARMv7 instruction set (ARM4 or ARM33 device). Verification takes about 3 seconds with __BLST_PORTABLE__ build. Could you hazard a guess on how much speed up I could get with an ARMv7 asm implementation?

By compare, ECDSA signature verification take about 75 msec with asm from https://github.com/kmackay/micro-ecc, using about 40K of static lookup tables. Am I correct that there is less role for lookup tables with BLS?

mratsim commented 2 months ago

Pairings (BLS signatures) are significantly slower than scalar multiplications (ECDSA). In my old BLST benches: https://github.com/status-im/nim-blst/issues/1 there was a ratio of 6x and this is without the need of double pairings and hashing-to-curve for BLS verification.

If you look at micro-ecc, they use the ARM crypto DSP with UMAAL instruction: https://github.com/kmackay/micro-ecc/blob/master/asm_arm_mult_square_umaal.inc

This paper https://eprint.iacr.org/2018/700.pdf has some timings using generic Montgomery-based field multiplication (like BLST) with Neon (it also goes over UMAAL) and the ratio is 6.5x: image

If you want to speedup signature verifications, maybe explore batching the verification, if you can verify 20+ at once, the total verification time should be lower than 20+ ECDSA signatures

dot-asm commented 2 months ago

NEON

Since there is no such thing as "ARM4 or ARM33 device," it's more likely that the question is rather about Cortex-M4/M33, micro-controllers, which don't have the NEON unit. And with UMAAL being optional on M33, so it's safer to assume that it's not an option either. This is not to say that assembly won't give meaningful improvement, but it won't be as impressive as suggested.

verify

One can wonder why would one want to task a micro-controller with such a heavy operation as BLS signature verification. But either way, as pointed out by @mratsim, there is no way to get BLS anywhere near ECDSA in terms of wall-clock performance. Even signature generation is computationally heavier, let alone verification.

timethernom commented 2 months ago

Right, Cortex-M4/M33 is what I meant, and can require UMAAL if it delivers significant gain. Application is to deliver consensus proof to a hardware wallet. Probably will use schnorr threshold signature scheme like FROST, but would like to avoid extra steps that may fail. 3 seconds is too slow. .5 seconds might be viable. Signing and aggregation speed is not a concern.

dot-asm commented 2 months ago

can require UMAAL if it delivers significant gain.

Depends on definition of "significant", but if taking it from 3s to 0.5s is the goal, UMAAL won't get you there, no.

dot-asm commented 2 months ago

BTW, how did you compile it exactly? Do you use say Rust bindings or make raw C calls?

timethernom commented 2 months ago

I just added server.c to a Keil project. My (out of date) Keil compiler threw some warnings on the inline functions calling static functions in no_asm.h, and then crashed. I got it to build by removing the static function declarations in no_asm.h.

This was just to see if I could get in the ballpark. I expected not, given RFC showing 30x slower than ECDSA. But, the blst lib was such a smooth build and test on MacOS, I decided worth checking out on one of my devices.

dot-asm commented 2 months ago

I just added server.c to a Keil project.

Question is what was the optimization level. A possible pitfall is that assembler-assisted ECDSA won't be that dependent on optimization level, while pure C would be very dependent. Secondly, if your target executable format is ELF, you should be able to try newer gcc to compile libblst.a and link your project with it. The thing is that blst doesn't have explicit dependencies on language run-time, which should give you freedom to mix the compilers. It would be good idea to add -freestanding flag, e.g. as ./build.sh CROSS_COMPILE=arm-none-eabi- -ffreestanding. What I'm trying to ensure is that 3s baseline is actually representative.

dot-asm commented 2 months ago

./build.sh CROSS_COMPILE=arm-none-eabi- -ffreestanding

Hmm, gcc leaves reference to memset (without us calling it) and __aeabi_uidivmod. ./build.sh CC=clang --target=arm-none-eabi -ffreestanding is another option. This still leaves reference to __aeabi_uidivmod, which originates in multi_scalar.c. If it's a problem, -ffunction-sections is likely to remedy it. Try even -O3...

dot-asm commented 2 months ago

Oh! And of course you'd need to throw in -mcpu=cortex-m4 option.

dot-asm commented 2 months ago

Oh! And of course you'd need to throw in -mcpu=cortex-m4 option.

BTW, clang even issues umaal instructions, so that assembly won't give you an edge is respect to hardware capabilities utilization. This is still not to say that it won't give meaningful improvement, but it won't be direct result of utilizing "magic" instructions.

timethernom commented 2 months ago

Thanks! I will try this and see if any performance difference. This was with -O3 optimization.

timethernom commented 2 months ago

Easier said than done. Keil does not like the library format, so will have to dig around on what is required there.

I guess the main acceleration in the kmackay ECDSA comes from the 20K look up table. In my intended bls application, the same public key will be used repeatedly. Is there any possibility to take advantage of that? It's not the batch scenario, because the signatures arrive over time. But, there would be room to cache some pre-computes.

dot-asm commented 2 months ago

Easier said than done. Keil does not like the library format, so will have to dig around on what is required there.

Giving it server.o might work better...

I guess the main acceleration in the kmackay ECDSA comes from the 20K look up table. In my intended bls application, the same public key will be used repeatedly. Is there any possibility to take advantage of that? It's not the batch scenario, because the signatures arrive over time. But, there would be room to cache some pre-computes.

You make it sound as if the lookup table makes ECDSA 6 times faster. It's not. It does yield meaningful improvement, but it's not "magical." BLS is just significantly computationally heavier than ECDSA. But to the point, while there is a thing that can be pre-computed in BLS, the gain is marginal, it ~won't get you anywhere close to where you want to be~ would hardly tip the scales.

dot-asm commented 2 months ago

BLS is just significantly computationally heavier than ECDSA.

For an embedded system the correct answer is likely to be an integrated co-processor. One wishes OpenTitan was an accessible option, but it doesn't seem to be the case. Otherwise there is for example Kinetis K8x with usable capabilities in the Cortex-M4 space...

timethernom commented 2 months ago

The Kinetis K8x looks almost identical to what we are currently running on, an Infineon PSoC62. We are currently running the 150Mhz M4 at 48Mhz to save power, so there's a 3x gain available there. It also has "12 programmable logic blocks, each with 8 Macrocells and an 8-bit data path (called universal digital blocks or UDBs)". Not sure if that can be of any use.

The kmackay ECDSA with uECC_OPTIMIZATION_LEVEL 0 takes 10 sec to verify 1 signature. The big leap comes at level 1, which drops to 170 msec, with no change in image size. This drops to 75 msec at level 3 with 40K jump in image size. So, I guess the precomputes deliver about 2x. The big win was something else. Probably level 0 is just a very simplistic reference implementation.

Using server.o revealed the need to add -fshort-enum to the build, for compatibility with our existing compile options. But, then I got a bunch of link errors about relocating objects, mostly related to stdio. This is mysterious. Will investigate further some time.

dot-asm commented 2 months ago

But, then I got a bunch of link errors about relocating objects, mostly related to stdio.

As already mentioned, server.c has ~to~ no dependencies on C run-time, i.e. no references to stdio or whatever std. But ./build.sh does compile it with -fPIC by default, and it's not impossible to imagine that it might disorient your linker and confuse it into blaming everybody for its failure. Add -fno-pic to override (or don't pass -fPIC if you compile server.c yourself).

timethernom commented 2 months ago

No doubt the linker was confused. Removing -fPIC avoided that. But, no significant performance difference. clang -O2 was about 50 msec slower than the keil -03 build. clang -03 was about 50 msec faster. Based on a single sample. Thanks for your help.

dot-asm commented 2 months ago

To summarize. It's the ability to keep whole vectors in the register bank that translates to performance "leaps". Most notably one can barely accommodate a 256-bit vector in ARM32 registers(*), but not a 384-bit one. BLS operates on 384-bit inputs, which is basically why one can't expect times-impressive-N improvements no matter what you do here. This is still not to say that assembly won't provide meaningful improvement, but it won't take you anywhere near the stated objective.

As for Kinetis K8x. K82 is the only one with relevant capabilities, more specifically a modular arithmetic unit. However, I don't actually have hands-on experience with it, so don't consider the mention as "this one will actually solve the problem."

(*) Which is what the referred micro-ecc does with high enough uECC_OPTIMIZATION_LEVEL.

dot-asm commented 2 months ago

don't consider the [Kinetis K82] mention as "this one will actually solve the problem."

Just in case. One has to recognize that the co-processor is still power-constrained, which in practice means that it won't run as fast as one wishes. It surely can deliver significant improvements on "large" operations such as modular exponentiation and ECC arithmetic, but the most computationally intensive part of BLS signature verification would have to rely on individual field operations, multiplications, additions, etc. And it's not impossible to imagine that the overhead of frequent data transfers to/from co-processor could sum up and diminish the wow factor.