supranational / blst

Multilingual BLS12-381 signature library
Apache License 2.0
471 stars 177 forks source link

Support 64-bit limbs on no-asm platforms #109

Open uweigand opened 2 years ago

uweigand commented 2 years ago

Currently, platforms without assembler support always use 32-bit limbs, but the Rust bindings always assume 64-bit limbs. This breaks on big-endian platforms like our IBM Z (s390x).

This patch enables 64-bit limbs on 64-bit platforms even if there is no hand-written assembler, by using a 128-bit integer type in the C implementation (this is an extension that is widely supported on 64-bit platforms with GCC or LLVM).

This fixes the broken Rust bindings on IBM Z, and also improves performance by a factor or 3 or more, because compiler-generated code handling __int128 already uses the 64x64->128 multiply instruction our ISA provides.

To improve performance of compiler-generated code a bit more, this also switches to the -O3 optimization level, which helps with unrolling of the Montgomery multiply core loop.

mratsim commented 2 years ago

For all platforms that don't support native 64x64->128 multiply (does MIPS or Riscv5 for example?) is the GCC/Clang fallback constant-time or does it use ternary or comparisons for carries?

uweigand commented 2 years ago

For all platforms that don't support native 64x64->128 multiply (does MIPS or Riscv5 for example?) is the GCC/Clang fallback constant-time or does it use ternary or comparisons for carries?

The 64-bit versions of MIPS and RISC-V support a native 64x64->128 multiply (either as a single instruction or as a pair of instructions).

I can find only one 64-bit target supported by LLVM that doesn't have a native implementation (sparc64), and this uses a call to the __multi3 library routine. I guess it depends on where this library is taken from, but at least the LLVM (compiler-rt) default implementation of it doesn't appear to use conditional code paths: https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/multi3.c

If this is a concern, we could make use of 64-bit limbs conditional on the architecture, with s390x (and possibly others where it makes sense) opting in?

uweigand commented 2 years ago

Ping? Any update on this?

emlowe commented 1 year ago

I know this PR is over a year old now, but I recently was porting Chia to a VisionFive2 riscv64 SBC (StarFive JH7110 with RISC-V quad-core CPU with 2 MB L2 cache and a monitor core, supporting RV64GC ISA, working up to 1.5 GHz ) and I decided to try this patch to see if there were any speed improvements.

Unfortunately, our wrapper library and test code crashes and dumps core when using this patch. (https://github.com/Chia-Network/bls-signatures)

uweigand commented 1 year ago

Unfortunately, our wrapper library and test code crashes and dumps core when using this patch. (https://github.com/Chia-Network/bls-signatures)

I was able to reproduce this on s390x. The problem is this __builtin_assume in src/no_asm.h:

static limb_t quot_rem_n(limb_t *div_rem, const limb_t *divisor,
                                          limb_t quotient, size_t n)
{
    __builtin_assume(n != 0 && n%2 == 0);

which was added by commit https://github.com/supranational/blst/commit/7cc926705fb85313c5d1837fde4665e9a8a5fdfb after I originally created and tested this PR.

This assumes that quot_rem_n is always called with an n that is a multiple of two. However, there are only two call sites:

inline limb_t quot_rem_128(limb_t *div_rem, const limb_t *divisor,
                                            limb_t quotient)
{   return quot_rem_n(div_rem, divisor, quotient, NLIMBS(128));   }

inline limb_t quot_rem_64(limb_t *div_rem, const limb_t *divisor,
                                           limb_t quotient)
{   return quot_rem_n(div_rem, divisor, quotient, NLIMBS(64));   }

so n is always equal to NLIMBS(128) or NLIMBS(64). Without this patch, the limb size in this file is always 32, so n is either 4 or 2. But with this patch, the limb size becomes 64, so n is 2 or 1, breaking the assumption.

If I simply remove the && n%2 == 0, the test suite (./build/src/runtest) passes again for me, and the benchmark suite (./build/src/runbench) shows significant improvements, up to a factor of 3 or so.

I think removing this part of the assumption (or even the whole assumption) should be fine on other platforms too, as this function will likely be inlined into its two call sites where n becomes constant anyway. But I haven't verified this.

emlowe commented 10 months ago

Took me some time to get back to this.

but YES, applying your suggestion and all tests pass in the VisionFive2 SBC and it is considerably faster at roughly 2.5 times faster for signing and verification - pretty much everything is 2.5 times faster

I would LOVE for this to get merged in - until we get someone to write some actual riscv assembler this is likely as good as it can get.

uweigand commented 10 months ago

but YES, applying your suggestion and all tests pass in the VisionFive2 SBC and it is considerably faster at roughly 2.5 times faster for signing and verification - pretty much everything is 2.5 times faster

Thanks for verifying! I've now updated this to fix the incorrect assertion.

hoffmang9 commented 8 months ago

Any chance this can get in with the next major release?