Multi-core compiler evaluation

comidan commented 3 years ago

Hello! Really great work with EVA!

I was trying to use it but I was unable to scale it efficiently in time. I checked via htop on Ubuntu 20 that it was being used only one core. Now, I am on a VM so if the problem is this that's okay but otherwise would it be possible to let the code being compiled have multi-core support?

I tried to build EVA with the cmake -DUSE_GALOIS=ON . command, but nothing changed.

Thank you for your time!

olsaarik commented 3 years ago

Hi! -DUSE_GALOIS=ON is indeed the right way to enable the multi-core support. I was able to reproduce the behavior on my end (also did not see multiple cores used in a program with ample concurrency), so it seems we have a regression. I'll investigate, thanks for filing the issue!

olsaarik commented 3 years ago

Turns out Galois was defaulting to 1 thread and we were missing a call to galois::setActiveThreads.

This issue should be fixed in main with EVA defaulting to as many threads as there are cores available to the process. I've also exposed a new function set_num_threads in the Python API that lets you control the amount of parallelism.

comidan commented 3 years ago

Thank you very much for the support @olsaarik !

I tried right now to clone EVA and install it with -DUSE_GALOIS=ON but when I try to run my code I don't see any parallelism in place by looking at HTOP:

My code is this:

from eva import EvaProgram, Input, Output, evaluate
from eva import evaluate
from eva.ckks import CKKSCompiler
from eva.seal import generate_keys
from eva.metric import valuation_mse
from common import *
from random import uniform
import numpy as np
import unittest
import math

batch = 2
dim = 32
dim1 = 32

def linear(x, y):
    return np.matmul(x, y) # + bias

def attention(x):
    temp = np.array([i for i in range(dim*dim1)]).reshape(dim, dim1) / 1000
    query = linear(x, temp)
    key = linear(x, temp)
    value = linear(x, temp)
    return query, key, value

def self_attention(query, key, value):
    temp = linear(query, key)

    temp = linear(temp, np.arange(dim*dim).reshape(dim,dim))

    temp = linear(temp, value)
    return temp

def norm(x):
    return linear(x, np.arange(dim*dim).reshape(dim,dim))

def ff(x):
    return linear(linear(x, np.arange(dim*dim).reshape(dim, dim)/1000), np.arange(dim*dim).reshape(dim, dim)/1000)

def classifier(x):
    v = np.arange(dim*2).reshape(dim,2)/1000
    w = np.arange(2*2).reshape(2,2)/1000
    return np.dot(np.dot(x, v), w)

transformer = EvaProgram('transformer', vec_size=dim*dim1)
with transformer:
    x = np.array([Input(f'x{i}') for i in range(batch*dim*dim1)]).reshape(batch, dim, dim1)
    print(x.shape)
    query, key, value = attention(x)
    attn = self_attention(query, key, value)
    out = norm(attn)
    out = ff(out)
    out = out ** 2
    out = norm(out)
    out = out ** 2
    out = classifier(out)
    print(out.shape)
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            for k in range(out.shape[2]):
                Output(f'y{i+j}', out[i][j][k])

transformer.set_input_scales(32)
transformer.set_output_ranges(32)

if __name__ == "__main__":
    compiler = CKKSCompiler(config={'security_level':'128', 'warn_vec_size':'false'})
    compiled, params, signature = compiler.compile(transformer)
    print("poly degree: ", params.poly_modulus_degree)
    print("prime bits: ", params.prime_bits)
    print("rotations: ", params.rotations)

I also tried to encapsulate the logic written inside the EvaProgram in a loop as suggested here #6 but only one core was being used, is there anything wrong? Notice that I'm just trying to compile the EvaProgram in order to obtain the CKKS parameters, as for now I'm not running any evaluation: I don't know if the multithreading is only for evaluating the error in computation given some parameters or if hopefully it's also present for determining the parameters.

Do you have any working example on multiple cores? It would be very good to have an example for multi core usage for EVA in trying to obtain the required CKKS parameters.

olsaarik commented 3 years ago

Thanks for clarifying your use case. Multi-core evaluation is currently only used for encrypting inputs and the actual evaluation of the program, i.e., public_ctx.encrypt and public_ctx.execute. Using multiple threads for the compilation is not fully straightforward, as most of the work in the compiler is done inside transformation passes that make modifications to the program.

I took a look at your program (nice re-use of Numpy by the way, I didn't know you could do that!) and it is larger than the programs we've used with EVA before. I exported it into DOT by writing out transformer.to_DOT() and it seems the program results in ~1.8 million instructions with some 663k of those being multiplications. I don't know the parameters required for this, but assuming a single multiplication takes at least 40 ms (which might be true for small parameters), this will take some 7ish hours to evaluate. Which is still totally reasonable :)

I tried playing with the dimensions and on my own laptop with 8 GB of RAM dim and dim1 set to 4 was still reasonably fast, but 8 used enough memory to start swapping and the process slowed down a lot. The Term class in EVA that represents these instructions is reasonably small, so this memory usage seems suspicious.

I'll dig into this more to figure out what is slowing down the compilation for this program. Thanks for the very interesting example!

comidan commented 3 years ago

Good morning/evening @olsaarik , thank you so much for the detailed explanation!

Yes indeed I tried to run the code with lower sizes like 4 or 8 exactly and it was fast, I got the parameters but then of course they are not good for way larger sizes like 32x32 or more.

Unfortunately I don't have much experience / didn't find much documentation with how CKKS parameters are chosen, or better: compared to the other HE scheme provided in SEAL, BFV, I find it not very straightforward on how should I search/apply some kind of criteria for the right prime coefficients' bit sizes (and the number of them) given a certain polynomial modulus degree. So, that's the reason I was leveraging on EVA, which is indeed really so great!

Ok that's great, I thank you very much!

benpbenp commented 3 years ago

I am also hitting performance limits at the compiler step. With multiplicative depth of only 1, compile time seems to go up quadratically with the number of operations in the program. Since the compilation is just a constant number of traversals through the program, that is unexpected right?

EDIT: To clarify, the multiplications are all plain text. Also, there are a similar number of Ciphertext additions. Additive depth 2.

multiplications	compile time (s)
125	2.86
500	15.19
1125	47.56
2000	116.65
3125	246.83
4500	478.75
6125	803
8000	1344.45
10125	2078
12500	3845.89

benpbenp commented 3 years ago

It seems I am able to work around this issue by converting all python-land constants into Eva Inputs with is_encrypted=False. For example, if x is an encrypted Input, instead of doing 2*x, I do p*x with p = Input("p", is_encrypted=False) and then set p to 2 at evaluation time. @comidan perhaps it is worth trying this in your case? I will try to create a small test case for this and put in a new issue.

microsoft / EVA

Multi-core compiler evaluation #8