Increasing the maximum number of levels

jon-chuang commented 4 years ago

As a continuation of the question about 128-bit words, I want to get some feedback on the possibility of increasing N beyond what SEAL supports for now.

For instance, in the context of CKKS at least, N=2^17 or N=2^18 could be accesible. At 16 bytes per 128-bit integer, N=2^18 with n=2^17 vector slots would yield a plaintext vectors of size 2MB. This may be of use for instance if processing a very large dataset or a very large image. From what I understand, with the default choice of coeff mod at least, the largest N for 128-bit security right now is N=2^16. I understand there is non-linear scaling in the complexity of the operations performed (for instance, the keyswitch key sizes grow as something like O(N^3)), so what I propose may not be suitable.

Say we have k=key_modulus.size()=4*15=60, resulting in a q < 2^3200, (we seem to be doing approximately scaling of coeff_modulus size with n), so we have 60 levels, much larger than the 8 or 15 used now with N=2^14 or 2^16. From the formula given in the HEAX paper, with n=2^17 slots, 59*60=3540 such vectors, and 16 bytes each, we get a key size of approximately 7.4GB. This is in contrast to the 14*15*2^15 *16 B~100MB keyswitch key sizes for n=2^15. This could possibly be mitigated by either pipelining over the keyswitch outer loop from an intermediate fast memory say Intel Optane or by spreading the computation across multiple devices that have a high-speed interconnect between them. Can someone who worked with the HEAX team comment on how much of a bottleneck reading from the DDR memory to the chip was for n=2^14? I am also unsure how many kskeys need to be generated for a given N or k. For every 2x increase in the max circuit depth, we get an 8x increase in the key sizes.

Something exciting about this is that with k=60, one could possibly evaluate neural networks of depth 60 with some significant memory overhead but relatively insignificant computational overhead, at the cost of needing to batch together data of 4x as large into a single SIMD vector. For instance, Darknet-53, the latest YOLOv3 model, could probably fit for this.

For N=2^19, we get keyswitch key sizes of 60GB, which can still fit into a large set of DRAMs, and we can evaluate circuits of something like 120 levels, which is about as high as a deep resnet would go. Furthermore, evaluating something like an LSTM may also become feasible, for instance for private speech recognition (or something more sensitive).

If we consider the NTT to be the key bottleneck for performance, this scales as n log_2 n, so we get 17/15 ~= 1.13x computational overhead per slot. This basically nothing compared to bootstrapping, which I understand is several order of magnitude greater than keyswitch (can someone give an estimate? From what I understand it's about 1000x the clock cycles?). Keyswitch overhead goes up in k^2, so we get 59*60/14\15 ~= 16 times overhead. As the rest of the operations are pointwise, we get an approximate total overhead for keyswitch strictly less than 18x (for the inner loop NTT), which is significantly less than bootstrapping.

I am not sure how accurate the calculations I have performed are. Are there any references to understand the security level and noise growth? I suppose these can be gleaned from the original papers on CKKS/RNS-CKKS?

Returning back to EVA, I am wondering if it can help optimise bootstrapping as well if it is ensured (or determines using microbenchmarking) that plaintexts and outputs lie within a certain range.

I am actually quite interested in pursuing the evaluation of depth ~60 circuits as a research project working with the SEAL, HEAX and CHET/EVA teams; it is primarily a systems challenge. We could probably start with deep CNNs. Then we can move on to LSTMs with sigmoid gating polynomial approximation done with input range restrictions (using analysis performed by EVA) and doing a rescaling analysis, and similarly for Transformers and softmax. I am currently figuring out how to integrate GPU support for SEAL. With multi-GPU support using something like a combination of RDMA and MPI/GPUDirect (for multinode), and with the right hardware setup with very large DRAMs, this seems within reach.

jon-chuang commented 4 years ago

Actually, I made some mistakes - the maximum N we use now is N=2^15, and if we were to use 64-bit RNS primes and words, then we only need N=2^17, which means KSK sizes of about 500MB. The overhead calculations are still approximately correct. Basically, my point still stands.

WeiDaiWD commented 4 years ago

Keep in mind that HEAX does not work for 2^15 yet. We do not have an feasible FPGA accelerator for 2^15 and larger yet in the field.

jon-chuang commented 4 years ago

Is the reason because of the size of the board? I suppose with more twiddle factors and a larger keyswitch circuit required for more keys to be processed in parallel, there is not enough memory and silicon? What is the possibility of connecting multiple FPGA boards together to overcome this? I consider 2^17 the gold standard to achieve practical homomorphically-encrypted deep learning.

Actually, I have a question about HEAX, which is how to do the type conversion to and from from 54-bit integers.

WeiDaiWD commented 4 years ago

Memory limit. Multiple boards are possible. DSPs only support 54-bit.

jon-chuang commented 4 years ago

By memory limit, do you mean DRAM or BRAM/SRAM?

WeiDaiWD commented 4 years ago

BRAM in fact. The size of data in BRAM is related to polynomial ring degree.

jon-chuang commented 4 years ago

I see, I was surprised to see that the very large chip Stratix GX 2800 only has 229 Mb SRAM. I am hoping to use a Xilinx Alveo U280 which has 344 Mb SRAM. Maybe this is enough to fit 2^15 slots for both the coefficients and the twiddle factors in the NTT modules.

Further, it has 8GB HBM, although the latency of HBM is similar to DRAM. Could you comment if any staging was used to load keyswitch keys from DRAM in HEAX? It seems one could lose more than a few cycles if one does not do so, but this would expend some BRAM to do so. (I recall it was quadruple buffering that was used? Or was this solely for the input poly?)

Actually, can I ask if only a single keyswitch module was synthesized for the largest parameter set on the GX 2800?

WeiDaiWD commented 4 years ago

I believe more than 1 copies of twiddle factors exist in BRAM.

Keyswitching keys are stored in DRAM except for the smallest parameter set. And staging is used to load keyswitching keys from DRAM.

For the largest parameter set, I think only one keyswitch module was synthesized but I'm not entirely certain. For questions regarding HEAX, Sadegh Riazi (author) can answer your questions better.

microsoft / SEAL

Increasing the maximum number of levels #107