microsoft / EVA

Compiler for the SEAL homomorphic encryption library
MIT License
223 stars 58 forks source link

Perform inference on multiple data in SIMD fashion #6

Open Hjeljeli opened 3 years ago

Hjeljeli commented 3 years ago

Hi Olli, I have a beginner's question. I managed to compile and execute successfully a code that runs an inference of some data 'x' (which is a 1D vector) on a simple linear regression model (composed of 1D vector 'weight' and a single value 'bias'). Here is the code

from eva import *
from eva.std.numeric import horizontal_sum
inference = EvaProgram('Inference', vec_size=len(weight_list))
with inference:
    x = Input('x', True)
    weight = Input('weight', True)
    bias = Input('bias', True)
    Output('out', horizontal_sum(x*weight)+bias)

inference.set_output_ranges(60)
inference.set_input_scales(60)

from eva.ckks import *
compiler = CKKSCompiler()
compiled_inference, params, signature = compiler.compile(inference)

How can I perform inference on a batch of data 'x' (x becomes a 2D vector) in SIMD fashion? Is this approach more efficient than calling the previous program multiple times?

olsaarik commented 3 years ago

Hi! This is actually an interesting question with a few different answers depending on how many weights there are and how large you can make the batch size. Lets assume the encryption parameter selection for this program says at least 4096 slots are required for security. That means that any vec_size up to 4096 slots will have the same cost per homomorphic operation.

Now, if you can make your batch size 4096 (or close to it) you could use 4096 different ciphertexts and place the weights/inputs for the ith batch into the ith slot of each ciphertext. The program will then look like a "normal" program where you have a separate input for each feature and weight and you just loop over a list of these to perform the multiplications and additions. This will give you the highest throughput, but will have high latency if you have many weights and you need to have enough samples available.

On the other hand, if you are interested in having low latency, but would like to fill out the 4096 available slots for efficiency, you can pack multiple x and weight vectors into a single ciphertext (like you suggested). To implement this you should place the different data vectors into x one after the other and possible pad them with zeros to make them power-of-two. For weight you should place copies of the weights at the same offsets as the data. Then x and weight will be multiplied pointwise as in your original program. For the horizontal_sum part you will have to modify the logic, as now you only want to sum up these "subvectors". The current implementation of it is:

def horizontal_sum(x):
    """ Sum together all elements of a vector. The result is replicated in all
        elements of the returned vector.

        Parameters
        ----------
        x : an EVA compatible type (see eva.py_to_eva)
            The vector to sum together
        """

    x = py_to_eva(x)
    i = 1
    while i < x.program.vec_size:
        y = x << i
        x = x + y
        i <<= 1
    return x

This rotates and sums the vector with offsets 1,2,4,8,... until the whole vector has been summed together. You can use the same logic, but instead of summing up to vec_size sum up to whatever is the length of your "subvector". Note that instead of producing the sum of the subvector into all slots, only one slot will hold the correct result, while the other will be garbage. This approach is likely to give you the best latency, while still benefitting from SIMD.

Also note that the second approach is only applicable if you have space to make copies of the data/weights. EVA will tell you if you do by emitting the following warning while compiling your non-batched program:

Program specifies vector size %i while at least %i slots are required for security.
This does not affect correctness, as the smaller vector size will be transparently emulated.
However, using a vector size up to %i would come at no additional cost.

These two approaches represent two ends of a design space. For example, if you only have 256 samples for a batch and want the best throughput, you might consider a variant of the first approach, but packing 16 values from each batch into each vector.

Finally, a variation on the approach of just calling the previous program multiple times is to add a loop around the logic inside your EvaProgram (plus do some numbering of inputs/outputs). While this does not benefit from SIMD, it does allow you to use EVA's multicore runtime to parallelize the execution. For this EVA would have to be configured with:

cmake -DUSE_GALOIS=ON .

Let me know if you have any questions!

comidan commented 3 years ago

Hi! This is actually an interesting question with a few different answers depending on how many weights there are and how large you can make the batch size. Lets assume the encryption parameter selection for this program says at least 4096 slots are required for security. That means that any vec_size up to 4096 slots will have the same cost per homomorphic operation.

Now, if you can make your batch size 4096 (or close to it) you could use 4096 different ciphertexts and place the weights/inputs for the _i_th batch into the _i_th slot of each ciphertext. The program will then look like a "normal" program where you have a separate input for each feature and weight and you just loop over a list of these to perform the multiplications and additions. This will give you the highest throughput, but will have high latency if you have many weights and you need to have enough samples available.

On the other hand, if you are interested in having low latency, but would like to fill out the 4096 available slots for efficiency, you can pack multiple x and weight vectors into a single ciphertext (like you suggested). To implement this you should place the different data vectors into x one after the other and possible pad them with zeros to make them power-of-two. For weight you should place copies of the weights at the same offsets as the data. Then x and weight will be multiplied pointwise as in your original program. For the horizontal_sum part you will have to modify the logic, as now you only want to sum up these "subvectors". The current implementation of it is:

def horizontal_sum(x):
  """ Sum together all elements of a vector. The result is replicated in all
      elements of the returned vector.

      Parameters
      ----------
      x : an EVA compatible type (see eva.py_to_eva)
          The vector to sum together
      """

  x = py_to_eva(x)
  i = 1
  while i < x.program.vec_size:
      y = x << i
      x = x + y
      i <<= 1
  return x

This rotates and sums the vector with offsets 1,2,4,8,... until the whole vector has been summed together. You can use the same logic, but instead of summing up to vec_size sum up to whatever is the length of your "subvector". Note that instead of producing the sum of the subvector into all slots, only one slot will hold the correct result, while the other will be garbage. This approach is likely to give you the best latency, while still benefitting from SIMD.

Also note that the second approach is only applicable if you have space to make copies of the data/weights. EVA will tell you if you do by emitting the following warning while compiling your non-batched program:

Program specifies vector size %i while at least %i slots are required for security.
This does not affect correctness, as the smaller vector size will be transparently emulated.
However, using a vector size up to %i would come at no additional cost.

These two approaches represent two ends of a design space. For example, if you only have 256 samples for a batch and want the best throughput, you might consider a variant of the first approach, but packing 16 values from each batch into each vector.

Finally, a variation on the approach of just calling the previous program multiple times is to add a loop around the logic inside your EvaProgram (plus do some numbering of inputs/outputs). While this does not benefit from SIMD, it does allow you to use EVA's multicore runtime to parallelize the execution. For this EVA would have to be configured with:

cmake -DUSE_GALOIS=ON .

Let me know if you have any questions!

Hello @olsaarik, I have a very similar problem. I built EVA as mentioned here using the cmake -DUSE_GALOIS=ON . command but even by doing this it does not seem to scale to multiple cores.

Here is the code I'm trying to scale with higher input sizes.

from eva import EvaProgram, Input, Output, evaluate
from eva import evaluate
from eva.ckks import CKKSCompiler
from eva.seal import generate_keys
from eva.metric import valuation_mse
from common import *
from random import uniform
import numpy as np
import unittest
import math

batch = 2
dim = 4
dim1 = 4

def linear(x, y):
    return np.matmul(x, y) # + bias

def attention(x):
    temp = np.array([i for i in range(dim*dim1)]).reshape(dim, dim1) / 1000
    query = linear(x, temp)
    key = linear(x, temp)
    value = linear(x, temp)
    return query, key, value

def self_attention(query, key, value):
    temp = linear(query, key)

    temp = linear(temp, np.arange(dim*dim).reshape(dim,dim))

    temp = linear(temp, value)
    return temp

def norm(x):
    return linear(x, np.arange(dim*dim).reshape(dim,dim))

def ff(x):
    return linear(linear(x, np.arange(dim*dim).reshape(dim, dim)/1000), np.arange(dim*dim).reshape(dim, dim)/1000)

def classifier(x):
    v = np.arange(dim*2).reshape(dim,2)/1000
    w = np.arange(2*2).reshape(2,2)/1000
    return np.dot(np.dot(x, v), w)

transformer = EvaProgram('transformer', vec_size=dim*dim1)
with transformer:
    x = np.array([Input(f'x{i}') for i in range(batch*dim*dim1)]).reshape(batch, dim, dim1)
    print(x.shape)
    query, key, value = attention(x)
    attn = self_attention(query, key, value)
    out = norm(attn)
    out = ff(out)
    out = out ** 2
    out = norm(out)
    out = out ** 2
    out = classifier(out)
    print(out.shape)
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            for k in range(out.shape[2]):
                Output(f'y{i+j}', out[i][j][k])

transformer.set_input_scales(32)
transformer.set_output_ranges(32)

if __name__ == "__main__":
    compiler = CKKSCompiler(config={'security_level':'128', 'warn_vec_size':'false'})
    compiled, params, signature = compiler.compile(transformer)
    print("poly degree: ", params.poly_modulus_degree)
    print("prime bits: ", params.prime_bits)
    print("rotations: ", params.rotations)

But indeed by running it and monitoring it via HTOP command I don't see any multi core usage, just one as before.

What am I doing wrong?

I read your previous comment regarding the fact of adding loops inside your code. I just tried to loop the logic for 3 times as example and to the input fetched inside the loop as x = z[i] where z = [x, x, x] (if this is what you meant by numerating the inputs)

It would be very good to have a working example for multi core usage for EVA, as the very good ones already provided for general usage.

I hope you will be able to read this and thank you anyway for your precious time!