Investigate what parts of Fhelipe can be incorporated into HEIR

WoutLegiest commented 4 months ago

https://people.csail.mit.edu/devadas/pubs/pldi24_fhelipe.pdf

Paper builds further on top of HECO, for optimal data layout Paper improves on FHE-Booster for finding optimal bootstrap placement in CKKS-contexts

Abstract: Fully Homomorphic Encryption (FHE) enables computing on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE has two key challenges that limit its applicability: it has high performance overheads (10,000× over unencrypted computation) and it is extremely hard to program. Recent hardware accelerators and algorithmic improvements have reduced FHE’s overheads and enabled large applications to run under FHE. These large applications exacerbate FHE’s programmability challenges. Writing FHE programs directly is hard because FHE schemes expose a restrictive, low-level interface that prevents abstraction and composition. Specifically, FHE requires packing encrypted data into large vectors (tens of thousands of elements long), FHE provides limited operations on these vectors, and values have noise that grows with each operation, which creates unintuitive performance tradeoffs. As a result, translating large applications, like neural networks, into efficient FHE circuits takes substantial tedious work. We address FHE’s programmability challenges with the Fhelipe FHE compiler. Fhelipe exposes a simple, numpy-style tensor programming interface, and compiles high-level tensor programs into efficient FHE circuits. Fhelipe’s key contribution is automatic data packing, which chooses data layouts for tensors and packs them into ciphertexts to maximize performance. Our novel framework considers a wide range of layouts and optimizes them analytically. This lets Fhelipe compile large FHE programs efficiently, unlike prior FHE compilers, which either use inefficient layouts or do not scale beyond tiny programs. We evaluate Fhelipe on both a state-of-the-art FHE accelerator and a CPU. Fhelipe is the first compiler that matches or exceeds the performance of large hand-optimized FHE applications, like deep neural networks, and outperforms a state-of-the-art FHE compiler by gmean 18.5×. At the same time, Fhelipe dramatically simplifies programming, reducing code size by 10× - 48×

asraa commented 4 months ago

Thank you for this!

@lawrencekhlim just for visibility into program optimizations for CKKS

j2kun commented 4 months ago

This is definitely on my immediate reading list.

j2kun commented 3 months ago

Notes as I read:

They have a Python-embedded tensor-DSL as their frontend
They support Lattigo and CraterLake as backends
Their code does not appear to be open source
They only target CKKS
They reuse EVA's waterline rescaling
The order of operations in the compiler is: parse, assign tensor layouts, with conversions, then to rescaling and bootstrap insertion

They have this expressive language for the packing layout options available to them. The part I'm puzzling over now is: how do they "perform many operations by simply changing the tensor layout (the permutation of index bits), without changing the underlying ciphertexts." It makes it sound like changes in packing are free, when of course they are not.

[edit]: section 5.3 makes it clearer that these are not entirely free operations. Their "stride" operation, for example, lowers to masking to drop the omitted values (which adds multiplicative depth). They still claim some of their ops are free (e.g., drop_dim: discard a dimension of size 1), but it's not fully clear to me how.

j2kun commented 3 months ago

In sec 5.4 they describe their layout assignment pass. It does a forward dataflow analysis:

Set each input to row-major order
Propagate layouts forward (some ops have an output layout that differs from the input layouts)
When encountering an op with two incompatible input layouts, insert a layout conversion op to make it compatible.

Then a second backward pass (or maybe during the same pass), it tries to greedily hoist the layout conversion ops as early as possible, so long as doing so reduces the cost of layout conversion. They apparently have a cost model for how expensive layout conversions are, based on the number of "rotate groups" as described in their implementation of layout conversion as implementing a permutation on the underlying data.

To that point, another big part of the paper is actually implementing layout conversions (5.5) which does something I haven't fully grokked yet related to decomposing a permutation into stages of groups of rotations, with some heuristics to decide how many stages to use. They cite section 3 of Algorithms in HELib ("Permutations and Shift-Networks") which seems to implement a general algorithm for computing the optimal decomposition of a permutation into shifts. This would be another good thing for us to implement.

j2kun commented 3 months ago

In 6.2 they define a bootstrap-placement algorithm based on dynamic programming: partition the dataflow graph by multiplicative depth, and then insert bootstrap operations for all nodes at a given level (or no nodes at that level), which requires a cost model of the individual operations (not just bootstrap). They get their cost model from CraterLake's spec.

j2kun commented 3 months ago

It's clear that there is a lot of good stuff in this paper to implement. CC @lawrencekhlim who is thinking about Halevi-Shoup packing. It might be too much for the remainder of his internship to tackle this entire paper (especially since the code is not public), but it might be worth figuring out how we could implement Halevi-Shoup in a way that is future-proof for these techniques.

google / heir

Investigate what parts of Fhelipe can be incorporated into HEIR #800