nucypher / nufhe

NuCypher fully homomorphic encryption (NuFHE) library implemented in Python
https://nufhe.readthedocs.io/en/latest/
GNU General Public License v3.0
441 stars 53 forks source link

using multi-GPU on nuFHE problem #6

Open Luidy opened 5 years ago

Luidy commented 5 years ago

Hello. I have 2 questions about nuFHE.

  1. I wanna connect several GPU to increase the speed of nuFHE. Is this possible?
  2. I wanna compute 1 ciphertext and 1 encoding text before generating the ciphertext. Is this possible?

I'll wait for your reply. Thanks.

fjarri commented 5 years ago

I wanna connect several GPU to increase the speed of nuFHE. Is this possible?

Technically, yes, although currently you will have to handle data transfer yourself. The GPU the work is happening on is defined by the Thread object you pass to nufhe functions. You will need to create several reikna's Thread objects for the target GPUs. It may look like:

from reikna.cluda import cuda_api
api = cuda_api()
platform = api.get_platforms()[0] # For CUDA there's always only one platform, for OpenCL there can be several
devices = api.get_devices()
thr1 = api.Thread.create(devices[0])
thr2 = api.Thread.create(devices[1]) 

(see reikna docs for details). Now there are several problems.

First, Thread objects (and underlying GPU contexts) have separate memory pools. (For CUDA, it may be possible to use unified memory, and for OpenCL, you can manually create a context, and then create Thread objects for a single context, but different CommandQueues. This will require interaction with PyCUDA or PyOpenCL, respectively. I haven't investigated either variant in details, so I don't know how well it will work.) This means that a ciphertext created with thr1 can only be used with other thr1-based ciphertext in a gate. If you want to pass data between GPUs, you will have to do it through CPU.

Second, if you just create two Thread objects in a single OS process, it won't give you much of a speedup - gates are not completely asynchronous and will block for the majority of execution time. So you will need to run a multi-process code and exchange the data between processes.

All in all, it is not straightforward at the moment. I will leave this issue open and try to figure out what kind of interface should be exposed to make multi-GPU convenient.

I wanna compute 1 ciphertext and 1 encoding text before generating the ciphertext. Is this possible?

I am not sure I understand you. Could you explain in more detail? And, perhaps, open a separate issue - this one will be reserved for multi-GPU.

hargathor commented 5 years ago

Also really interested in the evolution of this ticket

hargathor commented 5 years ago

Is anyone ever tested this on a vmware vGPU rig ?

fjarri commented 5 years ago

I am not sure at the moment what level of abstraction would be best. The minimal version would be something like this (this example is already working on my machine, just need to polish some things in the implementation). Essentially, this means one nufhe.Context per thread/process, and the user can choose whatever parallel execution model they want - be it threading, multiprocessing, or MPI. Will that be fine for your purposes (as a start, at least?).

It may be possible to do single-thread multi-GPU, but there are several problems to solve. CUDA and OpenCL use different models for that, and I need to check if PyCUDA and PyOpenCL actually expose the corresponding API (and reikna will require an update as well, since it uses the simple multi-GPU model above). Since most computations are batched over ciphertext bits, it probably won't be too hard to split them between GPUs automatically.

Edit: there may be problems if single-kernel bootstrap is not available, which means we'll have 500 kernel calls instead of several, so some internal thread/process pool will be necessary to parallelize that.

fjarri commented 5 years ago

I've added a multi-GPU example (examples/multi_gpu.py, commit 9539b6563d2e4897c869a68f1bdacc8c163b9059) and some supporting internals.