Open WoutLegiest opened 4 months ago
Thank you for this!
This is definitely on my immediate reading list.
Notes as I read:
They have this expressive language for the packing layout options available to them. The part I'm puzzling over now is: how do they "perform many operations by simply changing the tensor layout (the permutation of index bits), without changing the underlying ciphertexts." It makes it sound like changes in packing are free, when of course they are not.
[edit]: section 5.3 makes it clearer that these are not entirely free operations. Their "stride" operation, for example, lowers to masking to drop the omitted values (which adds multiplicative depth). They still claim some of their ops are free (e.g., drop_dim: discard a dimension of size 1), but it's not fully clear to me how.
In sec 5.4 they describe their layout assignment pass. It does a forward dataflow analysis:
Then a second backward pass (or maybe during the same pass), it tries to greedily hoist the layout conversion ops as early as possible, so long as doing so reduces the cost of layout conversion. They apparently have a cost model for how expensive layout conversions are, based on the number of "rotate groups" as described in their implementation of layout conversion as implementing a permutation on the underlying data.
To that point, another big part of the paper is actually implementing layout conversions (5.5) which does something I haven't fully grokked yet related to decomposing a permutation into stages of groups of rotations, with some heuristics to decide how many stages to use. They cite section 3 of Algorithms in HELib ("Permutations and Shift-Networks") which seems to implement a general algorithm for computing the optimal decomposition of a permutation into shifts. This would be another good thing for us to implement.
In 6.2 they define a bootstrap-placement algorithm based on dynamic programming: partition the dataflow graph by multiplicative depth, and then insert bootstrap operations for all nodes at a given level (or no nodes at that level), which requires a cost model of the individual operations (not just bootstrap). They get their cost model from CraterLake's spec.
It's clear that there is a lot of good stuff in this paper to implement. CC @lawrencekhlim who is thinking about Halevi-Shoup packing. It might be too much for the remainder of his internship to tackle this entire paper (especially since the code is not public), but it might be worth figuring out how we could implement Halevi-Shoup in a way that is future-proof for these techniques.
https://people.csail.mit.edu/devadas/pubs/pldi24_fhelipe.pdf
Paper builds further on top of HECO, for optimal data layout Paper improves on FHE-Booster for finding optimal bootstrap placement in CKKS-contexts
Abstract: Fully Homomorphic Encryption (FHE) enables computing on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE has two key challenges that limit its applicability: it has high performance overheads (10,000× over unencrypted computation) and it is extremely hard to program. Recent hardware accelerators and algorithmic improvements have reduced FHE’s overheads and enabled large applications to run under FHE. These large applications exacerbate FHE’s programmability challenges. Writing FHE programs directly is hard because FHE schemes expose a restrictive, low-level interface that prevents abstraction and composition. Specifically, FHE requires packing encrypted data into large vectors (tens of thousands of elements long), FHE provides limited operations on these vectors, and values have noise that grows with each operation, which creates unintuitive performance tradeoffs. As a result, translating large applications, like neural networks, into efficient FHE circuits takes substantial tedious work. We address FHE’s programmability challenges with the Fhelipe FHE compiler. Fhelipe exposes a simple, numpy-style tensor programming interface, and compiles high-level tensor programs into efficient FHE circuits. Fhelipe’s key contribution is automatic data packing, which chooses data layouts for tensors and packs them into ciphertexts to maximize performance. Our novel framework considers a wide range of layouts and optimizes them analytically. This lets Fhelipe compile large FHE programs efficiently, unlike prior FHE compilers, which either use inefficient layouts or do not scale beyond tiny programs. We evaluate Fhelipe on both a state-of-the-art FHE accelerator and a CPU. Fhelipe is the first compiler that matches or exceeds the performance of large hand-optimized FHE applications, like deep neural networks, and outperforms a state-of-the-art FHE compiler by gmean 18.5×. At the same time, Fhelipe dramatically simplifies programming, reducing code size by 10× - 48×