Open j2kun opened 11 months ago
I agreed to scope the work for this.
Key operations for switching between arithmetic and boolean ciphertexts:
Note that (only stated later), these can ONLY be applied to RLWE ciphertexts in the coefficient encoding. So the compiler must also insert encoding-switching ops.
The paper claims there are two implementations of repack, most based on decryption, but one based on automorphisms. The difference appears to be that the decryption-based version has lot bit precision (but is fast for a large number of ciphertexts) while the automorphism version has high bit precision but has incompatible encodings. The paper claims to improve the high precision repack for a small number of input LWE ciphertexts.
One challenge called out is the need to determine ciphertext levels and manage them; i.e., "automate the scheduling of ciphertext bootstrapping along with level management involving arithmetic-logic conversions."
Their middle-end is broken up into "transformation stage" (lower to FHE-friendly dialects, segment program) and "optimization stage" (lower to lwe/rlwe ops and do level management/parameter selection).
They have three dialects:
During parameter selection they do something I had not heard of before, called "multi-modulus bootstrapping", which "transforms the input ciphertext with a small ciphertext modulus into a ciphertext with a large modulus while retaining small ciphertext noises." They expand on that in Section V.
They define three target types, FHEFloat, FHEVector, FHEMatrix, in the fhe
dialect. Then they apply transforms like the conversions that @MeronZerihun is working on, converting if
s and for
s to their branchless analogues, before doing program segmentation. They handle for
with loop unrolling.
They consider all program inputs as secret, and propagate secretness forward. This is already handled in HEIR by the secret
dialect and usual lowerings.
Then they somehow lower the ops down to fhe
, though the details are not specified.
Program transformation: they start by mentioning that small amounts of arithmetic on few values can be done just fine in boolean FHE, and don't outweigh the cost of scheme switching.
Then they state their method: call everything non-RLWE, then look for regions where "scheme-switching can leverage SIMD capability." It seems like they reuse HECO's batching pass to identify such regions, and extend it to handle 2D matrices.
They now define their "mini-repack" op, which uses some "EvalTr" algorithm from https://eprint.iacr.org/2020/015 to put a single LWE ciphertext into the 0th coefficient of an RLWE ciphertext, then rotate via monomial mul, and accumulate many of these together via rlwe_add into a single ciphertext. This is known to be slow but should be fine for a small number of ciphertexts. They don't say how many.
Encoding optimization: since scheme switching requires encoding switching, they have a pass that decides on the best encoding throughout the program. They start with everything set to slot encoding, then identify program fragments that are better done as coefficient encoding. This looks like a combination of two greedily-applied patterns in Alg 4. Will inspect in detail in code later.
Subtlety described here: inner product in coefficient encoding requires one ciphertext have its coefficients reversed (bc poly mul = convolution of coeffs). Reversing order of an RLWE ciphertext is done here by unpacking and repacking. I wonder as an aside if one could do it with an RLWE automorphism and if that would be faster.
To deal with level management, they schedule (RLWE?) bootstrapping operations. The algorithm 2 appears to be a greedy-ish approach (i.e., they just count levels until it exceeds the max, then insert a bootstrap), though they also hard-code a maximum ciphertext level of 4 (max 4 mul ops before requiring a bootstrap) based on some benchmarks they looked at.
What's not clear to me here is how they are handling non-RLWE stuff. It looks like they assign levels to LWE ciphertexts, and have the LUT op from the RLWE dialect reset the level to zero, so I guess this implies scheme-switching does not consume levels (though they mention encoding-switching consumes 2 levels)
Their bootstrap is also something called "Mult-modulus bootstrap," which is new to me. It looks very similar to standard CGGI PBS, and so it only accepts LWE as inputs, which makes me unclear how they are doing level management for RLWE ciphertexts (maybe they are forced to switch to LWE, then bootstrap there, then reverse back to RLWE?). It looks like this bootstrap outputs LWE ciphertexts at a given "level" k (given modulus in the RLWE modulus chain) and requires a separate bootstrapping key for each target level k (! expensive!!). Then they make a special test polynomial based on the target modulus, and continue with PBS as usual.
This ends the paper and I'll go through the GH codebase next.
There appears to actually be only one dialect, "HEIR", which has all the ops and the separation above is logical, not dialect...ical.
The authors seem to take a few shortcuts by making all ops have AnyType as their operands, presumably because type conversions in the dialect conversion framework is annoying. They also seem to duplicate a bunch of "lut"-like ops, such as lut_half
("lut operation for add op in min value") and lut_abshalf
, lut_gt
, etc.) Just plain lut
is "lut operation without function evaluation", which is just a bootstrap, not a lut.
then_value * condition + (1 - condition) * else_value
.GenBootParams
here seems to compute the "parameters" for a TFHE bootstrap op boot_op
by computing 1 + the max multiplicative depth of the subcircuits starting from the uses of the output of boot_op
. I.e., for each use, they (recursively) scan forward through the IR from that use, and take the max over all such IRs. That 1+max_level argument is attached to the LUT op as an attribute, presumably used later to select parameters. This is also done for the input arguments, which are presumed to be cleanly bootstrapped but need special handling since they're block args.polygeist_eliminator.py
to replace the polygeist op that loads a vector from a matrix with FHEVectorLoadOp.Pass headers have no documentation, no tablegen.
The repo also does not contain a tools
path for the source of the executables, so I cannot tell how their heir-opt
combines the passes together, nor is their emitc-translate
tool available.
This is a shame because it seems like there is significant work happening there. None of the program segmentation code appears to be in the repository, nor any optimization to decide whether scheme-switching is worthwhile, nor the encoding-optimization optimizations. FHERepackOp is never created by any pass, nor referenced by any code (it seems they may be using the hard-coded euclidean distance and inner product patterns and then matching on them later here to generate "repack" operations. If that is true, the "program segmentation" is just a copy of HECO batching, and scheme switching is performed specifically for these two operations, which happen to be the central operations tested in the benchmark.
But still, the evaluation scripts include a flag --logic-emitc
which is not defined anywhere else, that appears to hide any additional logic that is missing from the repo. So there may be more.
In the end this all feels rather disappointing. Most of what I see in this project we either already have (multiplicative depth computations, HECO batching) or are in the process of building (if-to-select-like lowerings).
The only thing that seems worth porting is the tricks for computing coefficient-form operations in the slot2coeff pass. But that also seems like a low priority until we have encoding-switching operations.
Thanks for the great writeup! It seems like this isn't "real" open source and the most interesting bits are indeed missing. They have a reference to a "Docker version of HEIR" in their README, so I guess the full version only exists as a binary release :(
Here's a possible sketch of what we could do to get what appears to be feature parity with that project:
Use polygeist to convert some of their C test inputs to MLIR.
The big one would be k-means: https://github.com/heir-compiler/HEIR/blob/main/benchmarks/e2e/kmeans.c#L30-L94
A good starter might be their "min index"/"min value" at https://github.com/heir-compiler/HEIR/blob/main/benchmarks/logic/min_index.c
Their prep steps:
cgeist ./benchmarks/logic/min_value.c \
-function=min_value -S \
-raise-scf-to-affine \
--memref-fullrank -O0 \
>> min_value.mlir
One we have this, apply the existing HECO pass pipeline to see what kinds of batching naturally occurs (or what gaps there are in support for, e.g., inequality comparisons).
lwe.pack
: convert a vector of LWE ciphertexts into an RLWE ciphertextlwe.unpack
: take as input an RLWE ciphertext and an index
to extract the LWE ciphertext corresponding to the plaintext coefficient at that index. Note this only works with the coefficient encoding.lwe.switch_encoding
: take as static attributes the source and target encoding of an RLWE ciphertext and an SSA value for the RLWE ciphertext to convert in the source encoding, and output an RLWE ciphertext in the target encoding. Lowerings would have to match on the attributes to determine which encodings are legal, and we could also add a verifier to check at parse time.The heirSIMDVectorizerPipelineBuilder
will probably work well to start, and then we need a new lowering from secret
that will insert scheme switching ops at the layers between vectorized and non-vectorized ops. Because this is parity with heir-compiler/heir, we are assuming the batching pipeline figured out the segmentation and now we just need to make it type-compatible.
So we would need something like secret-to-fhe
where the RLWE scheme can be selected as an option, I
suppose, and code with secret-to-bgv
is refactored appropriately to be reused here. Eventually I imagine this would have bgv
and ckks
and bfv
as "fhe" targets.
I think since there is no risk of type conflicts in these lowerings, we can just add a new cggi-to-openfhe
and lwe-to-openfhe
to handle the CGGI and scheme switching ops. If not, we could expand bgv-to-openfhe
to support all schemes and the lwe toolbox dialect ops, and rename it scheme-api-to-openfhe
or similar.
The scheme switching API are at a higher level than pack/unpack/encoding switch (EvalCKKStoFHEW
), and so we would either need to reimplement the pack/unpack ops using lower-level OpenFHE primitives (risky) or else raise the scheme switching ops to higher-level OpenFHE API calls.
Some notes:
min_value
example from heir-compiler/heir is included as argmin
with four variants: https://github.com/openfheorg/openfhe-development/blob/main/src/pke/examples/scheme-switching.cpp#L60EvalCKKStoFHEWPrecompute
that can be separated from the evaluation.compare
, min
/max
, see e.g., https://github.com/openfheorg/openfhe-development/blob/13bf46f683483da1fe77f591b98035fa455740c6/src/pke/lib/scheme/ckksrns/ckksrns-schemeswitching.cpp#L1920After all the unrolling and HECO batching, lower from secret and then apply add-client-interface and scheme-api-to-openfhe.
In a separate pass, match and implement the slot2coeff tricks for doing inner products and euclidean distance. I'm not sure how we would necessarily codegen this unless OpenFHE supports encoding switching as part of their public API.
The biggest part of the proposal above that I don't like is the raising from lower-level pack/unpack ops to higher-level OpenFHE API calls. I would prefer if we had a higher-level scheme_switch
dialect that could represent high-level API calls to switch between named schemes, and that lowered to pack/unpack as needed, but goes directly to OpenFHE API calls when not lowering to polynomial.
So let me add:
Depends on all the scheme API dialects (cggi
, bgv
, ckks
, etc.) and has dedicated ops for each (source, target) scheme switch transformation that exists.
The second thing I don't like is that we don't have principled program segmentation or optimization involved. If we want to do this "right", then we would insert scheme switching ops as the output of some optimization pass that has a cost model for the cost of scheme switching and the cost of individual ops in each scheme, and determines the right time to scheme switch. It would ideally operate at the secret
level, similar to the HECO batching pass, but be able to say:
A smarter analysis would do this jointly across the entire basic block (say, via a linear program, but that is my default idea), but we could start by doing it for each op individually.
I think the main thing will be the scheme switching ideas.
Associated paper: https://eprint.iacr.org/2023/1445