google / heir

A compiler for homomorphic encryption
https://heir.dev/
Apache License 2.0
324 stars 47 forks source link

Investigate what parts of https://github.com/heir-compiler/heir can be ported to HEIR #300

Open j2kun opened 11 months ago

j2kun commented 11 months ago

I think the main thing will be the scheme switching ideas.

Associated paper: https://eprint.iacr.org/2023/1445

j2kun commented 3 months ago

I agreed to scope the work for this.

j2kun commented 3 months ago

Summary of https://eprint.iacr.org/2023/1445.pdf

Section II

Key operations for switching between arithmetic and boolean ciphertexts:

Note that (only stated later), these can ONLY be applied to RLWE ciphertexts in the coefficient encoding. So the compiler must also insert encoding-switching ops.

The paper claims there are two implementations of repack, most based on decryption, but one based on automorphisms. The difference appears to be that the decryption-based version has lot bit precision (but is fast for a large number of ciphertexts) while the automorphism version has high bit precision but has incompatible encodings. The paper claims to improve the high precision repack for a small number of input LWE ciphertexts.

Section III

One challenge called out is the need to determine ciphertext levels and manage them; i.e., "automate the scheduling of ciphertext bootstrapping along with level management involving arithmetic-logic conversions."

Their middle-end is broken up into "transformation stage" (lower to FHE-friendly dialects, segment program) and "optimization stage" (lower to lwe/rlwe ops and do level management/parameter selection).

They have three dialects:

During parameter selection they do something I had not heard of before, called "multi-modulus bootstrapping", which "transforms the input ciphertext with a small ciphertext modulus into a ciphertext with a large modulus while retaining small ciphertext noises." They expand on that in Section V.

Section IV (program transformation)

They define three target types, FHEFloat, FHEVector, FHEMatrix, in the fhe dialect. Then they apply transforms like the conversions that @MeronZerihun is working on, converting ifs and fors to their branchless analogues, before doing program segmentation. They handle for with loop unrolling. They consider all program inputs as secret, and propagate secretness forward. This is already handled in HEIR by the secret dialect and usual lowerings. Then they somehow lower the ops down to fhe, though the details are not specified.

Program transformation: they start by mentioning that small amounts of arithmetic on few values can be done just fine in boolean FHE, and don't outweigh the cost of scheme switching.

Then they state their method: call everything non-RLWE, then look for regions where "scheme-switching can leverage SIMD capability." It seems like they reuse HECO's batching pass to identify such regions, and extend it to handle 2D matrices.

They now define their "mini-repack" op, which uses some "EvalTr" algorithm from https://eprint.iacr.org/2020/015 to put a single LWE ciphertext into the 0th coefficient of an RLWE ciphertext, then rotate via monomial mul, and accumulate many of these together via rlwe_add into a single ciphertext. This is known to be slow but should be fine for a small number of ciphertexts. They don't say how many.

Encoding optimization: since scheme switching requires encoding switching, they have a pass that decides on the best encoding throughout the program. They start with everything set to slot encoding, then identify program fragments that are better done as coefficient encoding. This looks like a combination of two greedily-applied patterns in Alg 4. Will inspect in detail in code later.

Subtlety described here: inner product in coefficient encoding requires one ciphertext have its coefficients reversed (bc poly mul = convolution of coeffs). Reversing order of an RLWE ciphertext is done here by unpacking and repacking. I wonder as an aside if one could do it with an RLWE automorphism and if that would be faster.

Section V: Ciphertext level management

To deal with level management, they schedule (RLWE?) bootstrapping operations. The algorithm 2 appears to be a greedy-ish approach (i.e., they just count levels until it exceeds the max, then insert a bootstrap), though they also hard-code a maximum ciphertext level of 4 (max 4 mul ops before requiring a bootstrap) based on some benchmarks they looked at.

What's not clear to me here is how they are handling non-RLWE stuff. It looks like they assign levels to LWE ciphertexts, and have the LUT op from the RLWE dialect reset the level to zero, so I guess this implies scheme-switching does not consume levels (though they mention encoding-switching consumes 2 levels)

Their bootstrap is also something called "Mult-modulus bootstrap," which is new to me. It looks very similar to standard CGGI PBS, and so it only accepts LWE as inputs, which makes me unclear how they are doing level management for RLWE ciphertexts (maybe they are forced to switch to LWE, then bootstrap there, then reverse back to RLWE?). It looks like this bootstrap outputs LWE ciphertexts at a given "level" k (given modulus in the RLWE modulus chain) and requires a separate bootstrapping key for each target level k (! expensive!!). Then they make a special test polynomial based on the target modulus, and continue with PBS as usual.

This ends the paper and I'll go through the GH codebase next.

j2kun commented 3 months ago

Summary of https://github.com/heir-compiler/heir

Dialects

There appears to actually be only one dialect, "HEIR", which has all the ops and the separation above is logical, not dialect...ical.

The authors seem to take a few shortcuts by making all ops have AnyType as their operands, presumably because type conversions in the dialect conversion framework is annoying. They also seem to duplicate a bunch of "lut"-like ops, such as lut_half ("lut operation for add op in min value") and lut_abshalf, lut_gt, etc.) Just plain lut is "lut operation without function evaluation", which is just a bootstrap, not a lut.

Passes

Pass headers have no documentation, no tablegen.

The repo also does not contain a tools path for the source of the executables, so I cannot tell how their heir-opt combines the passes together, nor is their emitc-translate tool available.

This is a shame because it seems like there is significant work happening there. None of the program segmentation code appears to be in the repository, nor any optimization to decide whether scheme-switching is worthwhile, nor the encoding-optimization optimizations. FHERepackOp is never created by any pass, nor referenced by any code (it seems they may be using the hard-coded euclidean distance and inner product patterns and then matching on them later here to generate "repack" operations. If that is true, the "program segmentation" is just a copy of HECO batching, and scheme switching is performed specifically for these two operations, which happen to be the central operations tested in the benchmark.

But still, the evaluation scripts include a flag --logic-emitc which is not defined anywhere else, that appears to hide any additional logic that is missing from the repo. So there may be more.

In the end this all feels rather disappointing. Most of what I see in this project we either already have (multiplicative depth computations, HECO batching) or are in the process of building (if-to-select-like lowerings).

The only thing that seems worth porting is the tricks for computing coefficient-form operations in the slot2coeff pass. But that also seems like a low priority until we have encoding-switching operations.

AlexanderViand-Intel commented 3 months ago

Thanks for the great writeup! It seems like this isn't "real" open source and the most interesting bits are indeed missing. They have a reference to a "Docker version of HEIR" in their README, so I guess the full version only exists as a binary release :(

j2kun commented 3 months ago

Here's a possible sketch of what we could do to get what appears to be feature parity with that project:

Get test inputs

Use polygeist to convert some of their C test inputs to MLIR.

The big one would be k-means: https://github.com/heir-compiler/HEIR/blob/main/benchmarks/e2e/kmeans.c#L30-L94

A good starter might be their "min index"/"min value" at https://github.com/heir-compiler/HEIR/blob/main/benchmarks/logic/min_index.c

Their prep steps:

cgeist ./benchmarks/logic/min_value.c \
    -function=min_value -S \
    -raise-scf-to-affine \
    --memref-fullrank -O0 \
    >> min_value.mlir

One we have this, apply the existing HECO pass pipeline to see what kinds of batching naturally occurs (or what gaps there are in support for, e.g., inequality comparisons).

Add new ops to LWE toolbox dialect

Insert scheme switching ops

The heirSIMDVectorizerPipelineBuilder will probably work well to start, and then we need a new lowering from secret that will insert scheme switching ops at the layers between vectorized and non-vectorized ops. Because this is parity with heir-compiler/heir, we are assuming the batching pipeline figured out the segmentation and now we just need to make it type-compatible.

So we would need something like secret-to-fhe where the RLWE scheme can be selected as an option, I suppose, and code with secret-to-bgv is refactored appropriately to be reused here. Eventually I imagine this would have bgv and ckks and bfv as "fhe" targets.

Expand openfhe lowering to support CGGI and new scheme switching ops

I think since there is no risk of type conflicts in these lowerings, we can just add a new cggi-to-openfhe and lwe-to-openfhe to handle the CGGI and scheme switching ops. If not, we could expand bgv-to-openfhe to support all schemes and the lwe toolbox dialect ops, and rename it scheme-api-to-openfhe or similar.

Expand OpenFHE codegen to support scheme switching

The scheme switching API are at a higher level than pack/unpack/encoding switch (EvalCKKStoFHEW), and so we would either need to reimplement the pack/unpack ops using lower-level OpenFHE primitives (risky) or else raise the scheme switching ops to higher-level OpenFHE API calls.

Some notes:

Assemble into a new HECO-like pipeline

After all the unrolling and HECO batching, lower from secret and then apply add-client-interface and scheme-api-to-openfhe.

[optional] Implement slot2coeff tricks

In a separate pass, match and implement the slot2coeff tricks for doing inner products and euclidean distance. I'm not sure how we would necessarily codegen this unless OpenFHE supports encoding switching as part of their public API.

j2kun commented 3 months ago

The biggest part of the proposal above that I don't like is the raising from lower-level pack/unpack ops to higher-level OpenFHE API calls. I would prefer if we had a higher-level scheme_switch dialect that could represent high-level API calls to switch between named schemes, and that lowered to pack/unpack as needed, but goes directly to OpenFHE API calls when not lowering to polynomial.

So let me add:

Add a new scheme_switch dialect

Depends on all the scheme API dialects (cggi, bgv, ckks, etc.) and has dedicated ops for each (source, target) scheme switch transformation that exists.


The second thing I don't like is that we don't have principled program segmentation or optimization involved. If we want to do this "right", then we would insert scheme switching ops as the output of some optimization pass that has a cost model for the cost of scheme switching and the cost of individual ops in each scheme, and determines the right time to scheme switch. It would ideally operate at the secret level, similar to the HECO batching pass, but be able to say:

A smarter analysis would do this jointly across the entire basic block (say, via a linear program, but that is my default idea), but we could start by doing it for each op individually.