privacy-scaling-explorations / halo2

https://privacy-scaling-explorations.github.io/halo2/
Other
198 stars 120 forks source link

[RFC] Blackboxing MSM and FFT - Hardware Accel API #216

Closed mratsim closed 2 months ago

mratsim commented 11 months ago

Goals

Following https://github.com/privacy-scaling-explorations/halo2curves/pull/86 MSM and FFT have been moved to halo2curves following rationales in https://github.com/privacy-scaling-explorations/halo2curves/issues/84

This allows the proof systems to evolve separately from the algebra backend, that could be:

This RFC maps the next step, blackboxing MSM and FFT in this repo so they can be swapped out by any provider of a C API.

This has the following benefits:

Flexible low-level API design

For reference, these kind of glue APIs is usually called HAL for "Hardware Abstraction Layer" or "Hardware Acceleration Layer".

The proposed API is inspired by Computer Vision and Neural Network acceleration libraries which seems like the domains most similar in terms of constraints while being the most mature:

Example accelerator APIs from Image Processing and Deep Learning

The salient part of those API is the following:

All those APIs also return library specific status code, in our case we can assume that the code cannot fail. In particular out-of-memory will crash, which is already the case today.

Proposed API for MSM

The API is name-spaced with h2k for Halo2-KZG. We use BN254 as an example

The C API for multi-scalar multiplication must have the following signature

void h2k_mycurve_multi_scalar_mul(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        int len)

ABI

For SNARKS/pairing-friendly curves, all implementations use the same low-level representation of field elements and elliptic curve elements, which makes the following ABI a defacto standard.

We describe the ABI on 64-bit machines with BN254 as example:

#define pBits 254       // The number of bits in the curve field modulus
#define rBits 254       // The number of bits in the curve order
#define WordBitWidth 64 // The number of bits in Halo2-KZG words

#define words_required(bits) ((bits+WordBitWidth-1)/WordBitWidth)

typedef struct { uint64_t limbs[words_required(rBits)]; } bn254_fr;
typedef struct { uint64_t limbs[words_required(pBits)]; } bn254_fp;

typedef struct { bn254_fp x, y; } bn254_g1_aff;
typedef struct { bn254_fp x, y; z; } bn254_g1_prj;

Compatibility

AMD, Intel and Nvidia GPUs

As far as I am aware, all cryptographic libraries (CPU/GPU) are using the same Montgomery magic constant in their representation and their limb-endianess is little-endian.

Furthermore we note that ALL GPUs ISA use little-endian words (Intel, AMD, Nvidia, Apple).

When words and limbs are both little-endian, whether we use 32-bit or 64-bit words will not change the binary representation of the whole data structure.

This means that there is no need for conversion between a little-endian machine (x86, ARM) and a GPU even if words on one are 64-bit and for the other 32-bit. We can just cast/transmute between CPU and GPU pointers. This saves time and memory.

We caveat this for curves for which the number of limbs on 32-bit is not the double of 64-bit. This is the case for P224 (7 32-bit words, 4 64-bit words) only as far as I'm aware, and it's not an interesting curve.

FPGA

FPGA usually work in the canonical domain and use Barret Reduction instead of Montgomery so a conversion will be needed, hence endianness/zero-copy doesn't matter.

Apple GPUs, MIPS, RISC-V, WASM

While Apple GPU are also little-endian, there is one caveat, they do not support addition-with-carry or substraction-with-borrow and 32x32 -> 64 extended precision multiplication, at least not officially (unlike AMD, Intel and Nvidia GPUs). Looking at the reverse-engineering effort from Asahi Linux (https://rosenzweig.io/), I have not seen those instructions either. Hence for speed it's likely that an approach of using 9*29-bit limbs (instead of 8*32-bit) would yield a faster implementation and so Apple GPUs would need a conversion step.

This approach is also likely necessary for accelerating MIPS, RISC-V and WASM targets

Proposed async API

In some protocols we might want to process multiple MSMs/FFTs in parallel. For example, batch KZG verification described here: https://github.com/ethereum/consensus-specs/blob/v1.4.0-beta.2/specs/deneb/polynomial-commitments.md#verify_blob_kzg_proof_batch

does the following (with e a pairing function, and [a]₁ the scalar a multiplied by the generator of 𝔾1)

e(∑ [rᵢ][proofᵢ]₁, [τ]₂) . e(∑[rᵢ]([commitmentᵢ]₁ - [eval_at_challengeᵢ]₁) + ∑[rᵢ][zᵢ][proofᵢ]₁, [-1]₂) = 1

We can launch the 3 MSMs in an async manner and await their readiness before doing the pairing.

We copy the Cuda API, for example cudaMemCpyAsync and just suffix the function with async and return an opaque future handle.

engine_future_t h2k_mycurve_multi_scalar_mul_async(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        int len)

The engine should provide a function wait that allows blocking until the result is ready.

void engine_wait(engine_t* engine_context, engine_future_t handle)

This should be flexible enough to wrap Cuda Streams and OpenCL events, see also my high-level description of:

In Halo2, wait MUST be called once and only once. This allows memory reclamation of the handle and async task to be done before exiting wait. It allows guarantees no double-free.

For flexibility in scheduling computation graphs, the future handle may escape its scope and be returned by the function that used msmAsync. The engine MUST support this use-case.

Init/shutdown

The engine should also provide an init and shutdown function and may add configuration options there like number of threads for CPU backends or target GPUs for GPU backends.

FFT

TBD

Naming

I propose we refer to the API as "Accel" in discussion as we might not want only hardware accelerator but also other software backends.

Future considerations

For better acceleration, teams are considering the whole prover on GPUs. It would be interesting to know their feedback on the bottlenecks and if the async API covers that.

JasonStarlight commented 11 months ago

Hi, this is great! It's a very good definition and description of the hardware API for MSM and FFT. We also have some other ideas to share with you. This is our reference code on GitHub for the Halo2 hardware acceleration API.| We access various hardware components, including some API and struct definitions, by establishing the "device manager" module. Additionally, we describe how halo2 interacts with the device manager module.

https://github.com/superscalar-io/halo2_device_sample/tree/master

We have tested the "device manager" module, and the results are correct and as expected. This project serves as an example with GPU devices, computational units for MSM and NTT. It can also support other devices and computational units.

Regarding ABI, in my personal view, Halo2 establishes the raw data format, and hardware manufacturers may tailor it to align with their hardware characteristics. Additionally, regardless of the coordinate system used for hardware acceleration, it is ultimately converted to Projective, which is in line with halo2. This approach is okay. However, it's important to be aware that there may be hidden risks related to coordinate system calculations and adaptations, such as the to_affine() operation. Therefore, extra consideration is needed in this area.

Welcome everyone to join the discussion, looking forward to your replies! Thanks!

einar-taiko commented 9 months ago

Great work.

Regarding the computation graph - could you elaborate on what you model you had in mind here? I have a sense that the computation will need more info about the problem (e.g. #ops) but also on the available hardware to do any useful scheduling/ordering/distribution. Some of this info may have to be provided through the API.

Is it correct that we need to expose an async Rust wrapper, wrapping async C calls?

I would suggest size_t or ssize_t instead of int. E.g.

void h2k_mycurve_multi_scalar_mul(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        size_t len)

See man 3 size_t.

mratsim commented 8 months ago

A tentative trait has been proposed here https://github.com/privacy-scaling-explorations/halo2curves/pull/107.

This is focused on stateless MSM to start the discussion and also as the biggest bottleneck. Further primitives would be:

An implementation tutorial is available here:

adria0 commented 2 months ago

Implemented in #277