I think the main thing will be the scheme switching ideas.

Associated paper: https://eprint.iacr.org/2023/1445

I agreed to scope the work for this.

1445.pdf

Section II

Key operations for switching between arithmetic and boolean ciphertexts:

SampleExtract: given an RLWE ciphertext and an index i, extract the ith element of the encoded vector of the ciphertext
Repack: given a set of lwe ciphertexts, convert them to a single RLWE ciphertext.

Note that (only stated later), these can ONLY be applied to RLWE ciphertexts in the coefficient encoding. So the compiler must also insert encoding-switching ops.

The paper claims there are two implementations of repack, most based on decryption, but one based on automorphisms. The difference appears to be that the decryption-based version has lot bit precision (but is fast for a large number of ciphertexts) while the automorphism version has high bit precision but has incompatible encodings. The paper claims to improve the high precision repack for a small number of input LWE ciphertexts.

Section III

One challenge called out is the need to determine ciphertext levels and manage them; i.e., "automate the scheduling of ciphertext bootstrapping along with level management involving arithmetic-logic conversions."

Their middle-end is broken up into "transformation stage" (lower to FHE-friendly dialects, segment program) and "optimization stage" (lower to lwe/rlwe ops and do level management/parameter selection).

They have three dialects:

fhe: seems like a high-level FHE dialect with special types for vectors and matrices.
lwe: low level boolean FHE dialect
rlwe: low level arithmetic FHE dialect

During parameter selection they do something I had not heard of before, called "multi-modulus bootstrapping", which "transforms the input ciphertext with a small ciphertext modulus into a ciphertext with a large modulus while retaining small ciphertext noises." They expand on that in Section V.

Section IV (program transformation)

They define three target types, FHEFloat, FHEVector, FHEMatrix, in the fhe dialect. Then they apply transforms like the conversions that @MeronZerihun is working on, converting ifs and fors to their branchless analogues, before doing program segmentation. They handle for with loop unrolling. They consider all program inputs as secret, and propagate secretness forward. This is already handled in HEIR by the secret dialect and usual lowerings. Then they somehow lower the ops down to fhe, though the details are not specified.

Program transformation: they start by mentioning that small amounts of arithmetic on few values can be done just fine in boolean FHE, and don't outweigh the cost of scheme switching.

Then they state their method: call everything non-RLWE, then look for regions where "scheme-switching can leverage SIMD capability." It seems like they reuse HECO's batching pass to identify such regions, and extend it to handle 2D matrices.

They now define their "mini-repack" op, which uses some "EvalTr" algorithm from https://eprint.iacr.org/2020/015 to put a single LWE ciphertext into the 0th coefficient of an RLWE ciphertext, then rotate via monomial mul, and accumulate many of these together via rlwe_add into a single ciphertext. This is known to be slow but should be fine for a small number of ciphertexts. They don't say how many.

Encoding optimization: since scheme switching requires encoding switching, they have a pass that decides on the best encoding throughout the program. They start with everything set to slot encoding, then identify program fragments that are better done as coefficient encoding. This looks like a combination of two greedily-applied patterns in Alg 4. Will inspect in detail in code later.

Subtlety described here: inner product in coefficient encoding requires one ciphertext have its coefficients reversed (bc poly mul = convolution of coeffs). Reversing order of an RLWE ciphertext is done here by unpacking and repacking. I wonder as an aside if one could do it with an RLWE automorphism and if that would be faster.

Section V: Ciphertext level management

To deal with level management, they schedule (RLWE?) bootstrapping operations. The algorithm 2 appears to be a greedy-ish approach (i.e., they just count levels until it exceeds the max, then insert a bootstrap), though they also hard-code a maximum ciphertext level of 4 (max 4 mul ops before requiring a bootstrap) based on some benchmarks they looked at.

What's not clear to me here is how they are handling non-RLWE stuff. It looks like they assign levels to LWE ciphertexts, and have the LUT op from the RLWE dialect reset the level to zero, so I guess this implies scheme-switching does not consume levels (though they mention encoding-switching consumes 2 levels)

Their bootstrap is also something called "Mult-modulus bootstrap," which is new to me. It looks very similar to standard CGGI PBS, and so it only accepts LWE as inputs, which makes me unclear how they are doing level management for RLWE ciphertexts (maybe they are forced to switch to LWE, then bootstrap there, then reverse back to RLWE?). It looks like this bootstrap outputs LWE ciphertexts at a given "level" k (given modulus in the RLWE modulus chain) and requires a separate bootstrapping key for each target level k (! expensive!!). Then they make a special test polynomial based on the target modulus, and continue with PBS as usual.

This ends the paper and I'll go through the GH codebase next.

Summary of https://github.com/heir-compiler/heir

Dialects

There appears to actually be only one dialect, "HEIR", which has all the ops and the separation above is logical, not dialect...ical.

The authors seem to take a few shortcuts by making all ops have AnyType as their operands, presumably because type conversions in the dialect conversion framework is annoying. They also seem to duplicate a bunch of "lut"-like ops, such as lut_half ("lut operation for add op in min value") and lut_abshalf, lut_gt, etc.) Just plain lut is "lut operation without function evaluation", which is just a bootstrap, not a lut.

Passes

arith2heir: simple op->op conversions with data type conversions. What stands out is arith.constant -> heir.encode, and the variety of places heir.materialize (no-op type converter) is inserted to keep types "compatible" and it just happens to work out because their test programs are very specific.
batching: a copy of HECO's batching pass, with a slight modification to do type conversion for their FHERotateOp.
branch: replace scf.if with a branch-free select. The select is not arith.select bit a vanilla bootstrap (FHE"LUT" op) applied to then_value * condition + (1 - condition) * else_value.
encrypt: replaces materializations from other passes that "materialize" plaintext into ciphertext, and insert an encrypt op.
func2heir: a sort of complicated way to convert func.call to heir.FHEFunccall. Also strange things like "FHEExtractOp to FHEExtractfinalOp" which appears to just assert constant indices on the extract op and materialize them to attributes instead of SSA values. Unclear why that's in this pass.
genparm: parameter selection pass and "level" selection routine. GenBootParams here seems to compute the "parameters" for a TFHE bootstrap op boot_op by computing 1 + the max multiplicative depth of the subcircuits starting from the uses of the output of boot_op. I.e., for each use, they (recursively) scan forward through the IR from that use, and take the max over all such IRs. That 1+max_level argument is attached to the LUT op as an attribute, presumably used later to select parameters. This is also done for the input arguments, which are presumed to be cleanly bootstrapped but need special handling since they're block args.
heir2emitc: converts to emitc, using emitc.call for every operation.
lwe2rlwe: this one is strange because it says the patterns change LWE ops to RLWE ops, but that is not what they actually do. Instead, they do some type conversions and then replace it with the same op (e.g., LWEMulOp -> LWEMulOp). I would be surprised if this pass is relied on in a meaningful way because it seems broken.
memref2heir: converts memref -> LWECipherVec, LoadOp -> FHEExtractOp, StoreOp -> FHEInsertOp. A "FHEVectorLoadPattern" pattern defined here is unused. They instead use a manual script polygeist_eliminator.py to replace the polygeist op that loads a vector from a matrix with FHEVectorLoadOp.
nary: A copy of HECO's nary pass
slot2coeff: contains a custom pattern for computing euclidean distance and inner products as described in the paper. This trick is just hard coded here. I don't see anywhere they make a particular choice about when to apply this, it seems to apply it greedily.
unfold: Patterns to convert CmpFOp and similar to TFHE LUT calls hard-coded names.
unroll: a slight modification of HECO's unroll pass, adding more conversion target data.
vec2memref: noted as abandoned.

Pass headers have no documentation, no tablegen.

The repo also does not contain a tools path for the source of the executables, so I cannot tell how their heir-opt combines the passes together, nor is their emitc-translate tool available.

This is a shame because it seems like there is significant work happening there. None of the program segmentation code appears to be in the repository, nor any optimization to decide whether scheme-switching is worthwhile, nor the encoding-optimization optimizations. FHERepackOp is never created by any pass, nor referenced by any code (it seems they may be using the hard-coded euclidean distance and inner product patterns and then matching on them later here to generate "repack" operations. If that is true, the "program segmentation" is just a copy of HECO batching, and scheme switching is performed specifically for these two operations, which happen to be the central operations tested in the benchmark.

But still, the evaluation scripts include a flag --logic-emitc which is not defined anywhere else, that appears to hide any additional logic that is missing from the repo. So there may be more.

In the end this all feels rather disappointing. Most of what I see in this project we either already have (multiplicative depth computations, HECO batching) or are in the process of building (if-to-select-like lowerings).

The only thing that seems worth porting is the tricks for computing coefficient-form operations in the slot2coeff pass. But that also seems like a low priority until we have encoding-switching operations.

Thanks for the great writeup! It seems like this isn't "real" open source and the most interesting bits are indeed missing. They have a reference to a "Docker version of HEIR" in their README, so I guess the full version only exists as a binary release :(

Here's a possible sketch of what we could do to get what appears to be feature parity with that project:

Get test inputs

Use polygeist to convert some of their C test inputs to MLIR.

The big one would be k-means: https://github.com/heir-compiler/HEIR/blob/main/benchmarks/e2e/kmeans.c#L30-L94

A good starter might be their "min index"/"min value" at https://github.com/heir-compiler/HEIR/blob/main/benchmarks/logic/min_index.c

Their prep steps:

cgeist ./benchmarks/logic/min_value.c \
    -function=min_value -S \
    -raise-scf-to-affine \
    --memref-fullrank -O0 \
    >> min_value.mlir

One we have this, apply the existing HECO pass pipeline to see what kinds of batching naturally occurs (or what gaps there are in support for, e.g., inequality comparisons).

Add new ops to LWE toolbox dialect

lwe.pack: convert a vector of LWE ciphertexts into an RLWE ciphertext
lwe.unpack: take as input an RLWE ciphertext and an index to extract the LWE ciphertext corresponding to the plaintext coefficient at that index. Note this only works with the coefficient encoding.
lwe.switch_encoding: take as static attributes the source and target encoding of an RLWE ciphertext and an SSA value for the RLWE ciphertext to convert in the source encoding, and output an RLWE ciphertext in the target encoding. Lowerings would have to match on the attributes to determine which encodings are legal, and we could also add a verifier to check at parse time.

Insert scheme switching ops

The heirSIMDVectorizerPipelineBuilder will probably work well to start, and then we need a new lowering from secret that will insert scheme switching ops at the layers between vectorized and non-vectorized ops. Because this is parity with heir-compiler/heir, we are assuming the batching pipeline figured out the segmentation and now we just need to make it type-compatible.

So we would need something like secret-to-fhe where the RLWE scheme can be selected as an option, I suppose, and code with secret-to-bgv is refactored appropriately to be reused here. Eventually I imagine this would have bgv and ckks and bfv as "fhe" targets.

Expand openfhe lowering to support CGGI and new scheme switching ops

I think since there is no risk of type conflicts in these lowerings, we can just add a new cggi-to-openfhe and lwe-to-openfhe to handle the CGGI and scheme switching ops. If not, we could expand bgv-to-openfhe to support all schemes and the lwe toolbox dialect ops, and rename it scheme-api-to-openfhe or similar.

Expand OpenFHE codegen to support scheme switching

The scheme switching API are at a higher level than pack/unpack/encoding switch (EvalCKKStoFHEW), and so we would either need to reimplement the pack/unpack ops using lower-level OpenFHE primitives (risky) or else raise the scheme switching ops to higher-level OpenFHE API calls.

Some notes:

OpenFHE has scheme switching examples but only CKKS <-> TFHE. The min_value example from heir-compiler/heir is included as argmin with four variants: https://github.com/openfheorg/openfhe-development/blob/main/src/pke/examples/scheme-switching.cpp#L60
configure crypto context requires additional scheme switch enabling, key generation (EvalCKKStoFHEWKeyGen, EvalFHEWtoCKKSKeyGen), and setting up multiple scheme parmaeters
There is something called EvalCKKStoFHEWPrecompute that can be separated from the evaluation.
They have dedicated APIs for performing certain ops via scheme switching from CKKS to FHE: compare, min/max, see e.g., https://github.com/openfheorg/openfhe-development/blob/13bf46f683483da1fe77f591b98035fa455740c6/src/pke/lib/scheme/ckksrns/ckksrns-schemeswitching.cpp#L1920

Assemble into a new HECO-like pipeline

After all the unrolling and HECO batching, lower from secret and then apply add-client-interface and scheme-api-to-openfhe.

[optional] Implement slot2coeff tricks

In a separate pass, match and implement the slot2coeff tricks for doing inner products and euclidean distance. I'm not sure how we would necessarily codegen this unless OpenFHE supports encoding switching as part of their public API.

The biggest part of the proposal above that I don't like is the raising from lower-level pack/unpack ops to higher-level OpenFHE API calls. I would prefer if we had a higher-level scheme_switch dialect that could represent high-level API calls to switch between named schemes, and that lowered to pack/unpack as needed, but goes directly to OpenFHE API calls when not lowering to polynomial.

So let me add:

Add a new scheme_switch dialect

Depends on all the scheme API dialects (cggi, bgv, ckks, etc.) and has dedicated ops for each (source, target) scheme switch transformation that exists.

The second thing I don't like is that we don't have principled program segmentation or optimization involved. If we want to do this "right", then we would insert scheme switching ops as the output of some optimization pass that has a cost model for the cost of scheme switching and the cost of individual ops in each scheme, and determines the right time to scheme switch. It would ideally operate at the secret level, similar to the HECO batching pass, but be able to say:

You're trying to do a comparison operation on a vector of FHE ciphertexts?
The cost of that in arithmetic FHE is 29387
The cost in boolean FHE is 3452
The cost of switching is 2345
So insert plaintext tensor extract ops that represent the scheme switching and do the operations in scalar mode
Then the lowering from secret to fhe would interpret that as a scheme switch operation.

A smarter analysis would do this jointly across the entire basic block (say, via a linear program, but that is my default idea), but we could start by doing it for each op individually.

google / heir

Investigate what parts of https://github.com/heir-compiler/heir can be ported to HEIR #300