[RFC] Blackboxing MSM and FFT - Hardware Accel API

mratsim commented 11 months ago

Goals

Following https://github.com/privacy-scaling-explorations/halo2curves/pull/86 MSM and FFT have been moved to halo2curves following rationales in https://github.com/privacy-scaling-explorations/halo2curves/issues/84

This allows the proof systems to evolve separately from the algebra backend, that could be:

halo2curves
https://github.com/ingonyama-zk/icicle/
https://github.com/supranational/sppark
Applications for GPU-accelerated halo2 in Taiko grants:
- https://github.com/taikoxyz/grants/issues/35
- https://github.com/taikoxyz/grants/issues/68

This RFC maps the next step, blackboxing MSM and FFT in this repo so they can be swapped out by any provider of a C API.

This has the following benefits:

C is the lingua-franca for low-level, and the C ABI standardized, all languages can consume it. This gives maximum compatibility.
Calling C from Rust has zero overhead if the low-level API can be designed so that only std::mem::transmute is necessary
It is possible to reuse a library from another Rust ZK dialect (for example Arkworks) if they are ABI compatible and transmute it for Halo2.

Flexible low-level API design

For reference, these kind of glue APIs is usually called HAL for "Hardware Abstraction Layer" or "Hardware Acceleration Layer".

The proposed API is inspired by Computer Vision and Neural Network acceleration libraries which seems like the domains most similar in terms of constraints while being the most mature:

Lots of open-source libraries, so composability is key.
Math-heavy (Linear-algebra) so interfaces can be defined by math (FFTs ...).
GPU or dedicated accelerators friendly.
Operations may build a computation graph
Many level of async/parallelism opportunities so enabling those levels is important (graph-level and operation level for example)

Example accelerator APIs from Image Processing and Deep Learning

Nvidia Cuda and cuDNN, for example Matrix or Tensor addition C += αA: https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnAddTensor

cudnnStatus_t cudnnAddTensor(
    cudnnHandle_t                     handle,
    const void                       *alpha,
    const cudnnTensorDescriptor_t     aDesc,
    const void                       *A,
    const void                       *beta,
    const cudnnTensorDescriptor_t     cDesc,
    void                             *C)

Intel oneDNN for example Tensor Sum C = ∑ αᵢAᵢ _https://oneapi-src.github.io/oneDNN/group_dnnl_api_sum.html#details-group-dnnl-api-sum_ _https://oneapi-src.github.io/oneDNN/page_sum_example_cpp.html#doxid-sum-example-cpp_

    dnnl_status_t DNNL_API dnnl_sum_primitive_desc_create(
        dnnl_primitive_desc_t* sum_primitive_desc,
        dnnl_engine_t engine,
        const_dnnl_memory_desc_t dst_desc,
        int n,
        const float* scales,
        const_dnnl_memory_desc_t const* src_descs,
        const_dnnl_primitive_attr_t attr
        );

OpenCV HAL (Hardware Acceleration Layer), see RFC https://github.com/vinograd47/opencv-hal-proposal
In addition to input parameters all HAL functions accept CvHalContext struct, which contains some additional information like:
- OpenCV version
- Some parameters (number of threads)
- Information about CUDA device
- CUDA stream id
```
typedef struct _CvHalContext {
int opencv_version;
```
```
// OpenCV 3.0
int num_threads;

// OpenCV 3.1
cudaStream_t cuda_stream;
```
} CvHalContext;
```
in particular Machine Learning and Image Processing also requires FFT, and that can be a strong inspiration for our API.
```

The salient part of those API is the following:

They accept a "context" or "handle" pointer object for the engine. That context can be populated with all the metadata needed for the backend:
- hardware features detection
- threadpool metadata
- Cuda metadata (streams, device(s), ...)
- memory pools, ...
They may accept a "context" or "descriptor" object for the operation. That context may be populated with operation specific flags,
- in the case of FFT that could be whether bit-reversal permutation is needed for the source or the output.
- in case of MSM I do not see a need for this context object, except maybe for passing the actual number of bits used for special cases like MSM on binary values (https://github.com/privacy-scaling-explorations/halo2/pull/202#issuecomment-1708723183)
They may accept a "descriptor" object for memory layout of individual images/tensors. In image processing or machine learning, you can use zero-copy views over different subsets of a tensor/image using strides, see wrapping real and complex FFTs API from Numpy: https://github.com/SciNim/impulse/blob/26e25e7/impulse/fft/pocketfft.ni (wrapping https://gitlab.mpcdf.mpg.de/mtr/pocketfft/-/tree/cpp)

In our case, the memory layout API is unneeded.

All those APIs also return library specific status code, in our case we can assume that the code cannot fail. In particular out-of-memory will crash, which is already the case today.

Proposed API for MSM

The API is name-spaced with h2k for Halo2-KZG. We use BN254 as an example

The C API for multi-scalar multiplication must have the following signature

void h2k_mycurve_multi_scalar_mul(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        int len)

ABI

For SNARKS/pairing-friendly curves, all implementations use the same low-level representation of field elements and elliptic curve elements, which makes the following ABI a defacto standard.

We describe the ABI on 64-bit machines with BN254 as example:

#define pBits 254       // The number of bits in the curve field modulus
#define rBits 254       // The number of bits in the curve order
#define WordBitWidth 64 // The number of bits in Halo2-KZG words

#define words_required(bits) ((bits+WordBitWidth-1)/WordBitWidth)

typedef struct { uint64_t limbs[words_required(rBits)]; } bn254_fr;
typedef struct { uint64_t limbs[words_required(pBits)]; } bn254_fp;

typedef struct { bn254_fp x, y; } bn254_g1_aff;
typedef struct { bn254_fp x, y; z; } bn254_g1_prj;

Field elements are in Montgomery representation a' = aR (mod M) with:
- M the prime modulus of the field (either p the curve field prime or r the curve order)
- a the canonical representation of the field element
- a' the Montgomery representation
- R the Montgomery magic constant, R = (2^WordBitWidth)^numWords (mod M)
- With 64-bit words, for a 254-bit curve (4 words), R = 2^(64*4) (mod M)
- Limb-endianness is little-endian (least significant word first)
- Word-endianness is machine-endian (i.e. little-endian on x86, ARM, RISC-V, least significant bit first)
G1 projective elements use homogeneous projective coordinates:
- While Jacobian are usually used in software libraries, Halo2 switched to homogeneous projective in https://github.com/privacy-scaling-explorations/halo2curves/pull/19
- Coincidentally it is possible that homogeneous coordinates are more hardware friendly, especially for pipelining in FPGA, because of complete formulas that require no branches (but we can do "complete" Jacobian formulas with conditional moves so ...)
- Ingonyama uses projective coordinates: https://github.com/ingonyama-zk/icicle/blob/97f0079/icicle/primitives/projective.cuh See also: https://eprint.iacr.org/2022/999.pdf
- Sppark uses jacobian coordinates: https://github.com/supranational/sppark/blob/fffd734/ec/jacobian_t.hpp
- conversion is cheap (2 muls + 1 square): https://github.com/mratsim/constantine/blob/f925853/constantine/math/ec_shortweierstrass.nim#L30-L36
```
func projectiveFromJacobian*[F; G](
prj: var ECP_ShortW_Prj[F, G],
jac: ECP_ShortW_Jac[F, G]) {.inline.} =
prj.x.prod(jac.x, jac.z)
prj.y = jac.y
prj.z.square(jac.z)
prj.z *= jac.z
```

Compatibility

AMD, Intel and Nvidia GPUs

As far as I am aware, all cryptographic libraries (CPU/GPU) are using the same Montgomery magic constant in their representation and their limb-endianess is little-endian.

Furthermore we note that ALL GPUs ISA use little-endian words (Intel, AMD, Nvidia, Apple).

When words and limbs are both little-endian, whether we use 32-bit or 64-bit words will not change the binary representation of the whole data structure.

This means that there is no need for conversion between a little-endian machine (x86, ARM) and a GPU even if words on one are 64-bit and for the other 32-bit. We can just cast/transmute between CPU and GPU pointers. This saves time and memory.

We caveat this for curves for which the number of limbs on 32-bit is not the double of 64-bit. This is the case for P224 (7 32-bit words, 4 64-bit words) only as far as I'm aware, and it's not an interesting curve.

FPGA

FPGA usually work in the canonical domain and use Barret Reduction instead of Montgomery so a conversion will be needed, hence endianness/zero-copy doesn't matter.

Apple GPUs, MIPS, RISC-V, WASM

While Apple GPU are also little-endian, there is one caveat, they do not support addition-with-carry or substraction-with-borrow and 32x32 -> 64 extended precision multiplication, at least not officially (unlike AMD, Intel and Nvidia GPUs). Looking at the reverse-engineering effort from Asahi Linux (https://rosenzweig.io/), I have not seen those instructions either. Hence for speed it's likely that an approach of using 9*29-bit limbs (instead of 8*32-bit) would yield a faster implementation and so Apple GPUs would need a conversion step.

This approach is also likely necessary for accelerating MIPS, RISC-V and WASM targets

Proposed async API

In some protocols we might want to process multiple MSMs/FFTs in parallel. For example, batch KZG verification described here: https://github.com/ethereum/consensus-specs/blob/v1.4.0-beta.2/specs/deneb/polynomial-commitments.md#verify_blob_kzg_proof_batch

does the following (with e a pairing function, and [a]₁ the scalar a multiplied by the generator of 𝔾1)

e(∑ [rᵢ][proofᵢ]₁, [τ]₂) . e(∑[rᵢ]([commitmentᵢ]₁ - [eval_at_challengeᵢ]₁) + ∑[rᵢ][zᵢ][proofᵢ]₁, [-1]₂) = 1

We can launch the 3 MSMs in an async manner and await their readiness before doing the pairing.

We copy the Cuda API, for example cudaMemCpyAsync and just suffix the function with async and return an opaque future handle.

engine_future_t h2k_mycurve_multi_scalar_mul_async(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        int len)

The engine should provide a function wait that allows blocking until the result is ready.

void engine_wait(engine_t* engine_context, engine_future_t handle)

This should be flexible enough to wrap Cuda Streams and OpenCL events, see also my high-level description of:

Cuda async scheduling: https://github.com/mratsim/weave/issues/133
OpenCL async scheduling: https://github.com/mratsim/weave/issues/134

In Halo2, wait MUST be called once and only once. This allows memory reclamation of the handle and async task to be done before exiting wait. It allows guarantees no double-free.

For flexibility in scheduling computation graphs, the future handle may escape its scope and be returned by the function that used msmAsync. The engine MUST support this use-case.

Init/shutdown

The engine should also provide an init and shutdown function and may add configuration options there like number of threads for CPU backends or target GPUs for GPU backends.

FFT

TBD

Naming

I propose we refer to the API as "Accel" in discussion as we might not want only hardware accelerator but also other software backends.

Future considerations

For better acceleration, teams are considering the whole prover on GPUs. It would be interesting to know their feedback on the bottlenecks and if the async API covers that.

JasonStarlight commented 11 months ago

Hi, this is great! It's a very good definition and description of the hardware API for MSM and FFT. We also have some other ideas to share with you. This is our reference code on GitHub for the Halo2 hardware acceleration API.| We access various hardware components, including some API and struct definitions, by establishing the "device manager" module. Additionally, we describe how halo2 interacts with the device manager module.

https://github.com/superscalar-io/halo2_device_sample/tree/master

We have tested the "device manager" module, and the results are correct and as expected. This project serves as an example with GPU devices, computational units for MSM and NTT. It can also support other devices and computational units.

Regarding ABI, in my personal view, Halo2 establishes the raw data format, and hardware manufacturers may tailor it to align with their hardware characteristics. Additionally, regardless of the coordinate system used for hardware acceleration, it is ultimately converted to Projective, which is in line with halo2. This approach is okay. However, it's important to be aware that there may be hidden risks related to coordinate system calculations and adaptations, such as the to_affine() operation. Therefore, extra consideration is needed in this area.

Welcome everyone to join the discussion, looking forward to your replies! Thanks!

einar-taiko commented 9 months ago

Great work.

Regarding the computation graph - could you elaborate on what you model you had in mind here? I have a sense that the computation will need more info about the problem (e.g. #ops) but also on the available hardware to do any useful scheduling/ordering/distribution. Some of this info may have to be provided through the API.

Is it correct that we need to expose an async Rust wrapper, wrapping async C calls?

I would suggest size_t or ssize_t instead of int. E.g.

void h2k_mycurve_multi_scalar_mul(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        size_t len)

See man 3 size_t.

mratsim commented 8 months ago

A tentative trait has been proposed here https://github.com/privacy-scaling-explorations/halo2curves/pull/107.

This is focused on stateless MSM to start the discussion and also as the biggest bottleneck. Further primitives would be:

cached MSM, where the bases are cached due minimize RAM <-> GPU/FPGA memcpy costs
FFTs
coset FFTs

An implementation tutorial is available here:

https://github.com/mratsim/constantine/blob/master/docs/zk_accel_layer.md

adria0 commented 2 months ago

Implemented in #277

privacy-scaling-explorations / halo2