Very high computation cost of ciphertext-ciphertext multiplication for BFV

mhmughees commented 2 years ago

I was doing some seal testing on my macbook pro (i7) and got following results

Coefficient mod = 130 bits {{30,40,60}} and Polynomial degree= 8192

Running in batch mode

Ciphertext-Plaintext multiplication takes 0.211 milliseconds
Ciphertext-Ciphertext multiplication takes 24 milliseconds, when I pick 5 RNS moduli for coefficient mod
and it takes 9 milliseconds when I pick 2 RNS moduli for coefficient mod

I want to understand why ct-ct is so much more expensive, I was expecting only 4-10 more expensive due to tensor multiplication?

mhmughees commented 2 years ago

I added more detail to my above question for clarity

Pro7ech commented 2 years ago

Multiplication in BFV requires more than tensoring (unlike CKKS). Because of how the plaintext is encoded (in the higher bits), the tensoring must be done without modular reduction and is followed by a large rescaling (Q/t). To do the tensoring without modular reduction, the RNS basis is increased to QQ', where Q' ~= QN. It is then followed by an RNS quantization by Q/t. The basis extension and RNS quantization both scale quadratically with the number of moduli (like key-switching). Additionally, the tensoring is done in an RNS basis that is twice the size of the original basis, so it is also by itself more expensive.

mhmughees commented 2 years ago

okay thanks for explanation, so does around 100x time more time for ciphertext-ciphertext mul sounds right in ball park to you?

Pro7ech commented 2 years ago

Vs plaintext sure, because to multiply with a plaintext you do not need to scale it by Q/t, so it reduces to two polynomial multiplications in Z[X]_{Q}/(X^{N}+1) (2*(NTT -> MulVec -> iNTT)), if you assume that the plaintext is pre-encoded in the NTT domain.

Out of curiosity, I have tested it in Lattigo, using the parameters that you say you were using (N=8192, #Q=4, #P=1, I assume since in SEAL the last modulus is the special prime P):

goos: windows
goarch: amd64
pkg: github.com/tuneinsight/lattigo/v3/bfv
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
MulScalar/LogN=13/#Q=4/#P=1/lvl=3                                  39598 ns/op
Mul/op1=Ciphertext/op2=Ciphertext/LogN=13/#Q=4/#P=1/lvl=3        4722513 ns/op
Mul/op1=Ciphertext/op2=Plaintext//LogN=13/#Q=4/#P=1/lvl=3        3371792 ns/op
Mul/op1=Ciphertext/op2=PlaintextRingT/LogN=13/#Q=4/#P=1/lvl=3    1278644 ns/op
Mul/op1=Ciphertext/op2=PlaintextMul/LogN=13/#Q=4/#P=1/lvl=3       869158 ns/op
Square/LogN=13/#Q=4/#P=1/lvl=3                                   3660549 ns/op
Relin/LogN=13/#Q=4/#P=1/lvl=3                                    1969189 ns/op
RotateRows/LogN=13/#Q=4/#P=1/lvl=3                               2008407 ns/op
RotateCols/LogN=13/#Q=4/#P=1/lvl=3                               2020818 ns/op

But I dont' get a factor 100x, even with larger parameters I only get up to a factor ~10, so I'm puzzled o7

mhmughees commented 2 years ago

very interesting but I didnt get your parameters, I actually updated my first comment to add details of coefficient modulus, I mention it's value without last special parameter. Which updated parameters and on better aws machine each ciphertext-ciphertext mul is 18 milliseconds. I should also mention that plaintext-ciphertext mul was on NTTed entries so that might have improved their value a bit. I see your ciphertexts mul is around 4 milliseconds.. which is better than seal?

Pro7ech commented 2 years ago

If you have both operands already in the NTT domain for ct-pt mul, then a100x difference makes sens (I actually would expect more than 100x). But usually BFV ciphertexts are kept outside of the NTT domain if ct-ct will be involved, because the first and last steps of the tensoring require the ciphertexts to be outside of the NTT domain.

The parameters I used are N=8192, 4 primes for Q and 1 prime for P (the size of the primes is irrelevant for timings).

There are two reasons why the timings I get are faster. The most likely one is the CPU I used and the second is that Lattigo has a different approach to the quantization than SEAL does. SEAL uses BEHZ-style approach (https://eprint.iacr.org/2016/510.pdf) while Lattigo uses the ModUp and ModDown algorithms from CHKKS (https://eprint.iacr.org/2018/931.pdf).

mhmughees commented 2 years ago

Yeah I think processor difference may be playing a role here as well. Also as you are developer of Lattigo , can you confirm if Lattigo use 32 word size for smaller primes? I can ask this question on Lattigo github if you prefer.

Pro7ech commented 2 years ago

No it doesn't, at the moment everything uses 64-bit modular reduction.

mhmughees commented 2 years ago

Okay I just reran the experiment using aws c5.2xlarge with avx512 enabled

Coefficient modulli = {58,60,40} without special mod. N=8192

and Now

CT*CT ~ 9 ms

which is still higher than your value of ~4ms

Pro7ech commented 2 years ago

I ran a few experiments:

cpu: i9-12900K

SEAL: clang_cl_x64
Lattigo: go 1.18

Scheme: BFV
Params: N=8192, 4 primes Q, 1 prime P

 OP  | SEAL | Lattigo |
MulPt| 1.12 |  0.82   | -> difference probably due to the Montgomery reduction
MulCt| 7.17 |  4.59   | -> expected due to the different implementation
Relin| 1.89 |  1.86   | -> expected and shows that similar implementations lead to similar timings

I get better timings for SEAL with more moduli, without HEXL, so I conclude that most of the difference must be attributed to the CPU.

microsoft / SEAL

Very high computation cost of ciphertext-ciphertext multiplication for BFV #501