[Feature]: CUDA acceleration for Cheetah protocol

secretflow / spu

SPU (Secure Processing Unit) aims to be a provable, measurable secure computation device, which provides computation ability while keeping your private data protected.

https://www.secretflow.org.cn/docs/spu/en/

Apache License 2.0

223 stars 100 forks source link

[Feature]: CUDA acceleration for Cheetah protocol #503

Open BeStrongok opened 7 months ago

BeStrongok commented 7 months ago

Feature Request Type

Performance

Have you searched existing issues?

Is your feature request related to a problem?

No.

Describe features you want to add to SPU

Below.

Describe features you want to add to SPU

Hi, SPU team: The cheetah protocol is a high-performance two-party inference protocol which is running on CPUs, i'm thinking is there a way to apply CUDA acceleration on this protocol? Such as the matrix encoding process, or the computation process which compute the results of each modulus. Do you have any corresponding development plans?

fionser commented 7 months ago

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

anakinxc commented 7 months ago

Hi @BeStrongok

Based on the experience we had on accelerate ABY3 matmul with CUDA, the improvement might be marginal.

Consider MPC protocols usually have tasks that GPU cannot handle, like send/recv data through network, so there are many data movements between GPU and CPU and IO becomes a huge bottleneck. From some preliminary data collected from ABY3 GPT2 inference example, copy data to/from GPU can take ~95% of matmul time.

Another common issue is MPC protocols usually works on integers like int64/int128, these types not optimized for computing on either CPU or GPU, and lacks support from libraries like cuBLAS.

But feel free to give it a shot :P

BeStrongok commented 7 months ago

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

Thanks for pointing this repo :) , i'm also trying to do some experiments on applying the CUDA version of SEAL to Cheetah.

BeStrongok commented 7 months ago

Hi @BeStrongok

Based on the experience we had on accelerate ABY3 matmul with CUDA, the improvement might be marginal.

Consider MPC protocols usually have tasks that GPU cannot handle, like send/recv data through network, so there are many data movements between GPU and CPU and IO becomes a huge bottleneck. From some preliminary data collected from ABY3 GPT2 inference example, copy data to/from GPU can take ~95% of matmul time.

Another common issue is MPC protocols usually works on integers like int64/int128, these types not optimized for computing on either CPU or GPU, and lacks support from libraries like cuBLAS.

But feel free to give it a shot :P

Thank you for providing me with these useful information. :) Yes, there maybe exists a performance bottleneck in accelerating secret sharing protocols due to the frequent I/O, the acceleration for Homomorphic encryption used in Cheetah may be useful.

fionser commented 7 months ago

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

Thanks for pointing this repo :) , i'm also trying to do some experiments on applying the CUDA version of SEAL to Cheetah.

I would expect 60x faster keyswitching than single core CPU implementation, if you put the whole key-switching logic into GPU. However it might take "a little bit" work to do so.

The GPU code in privateLLM001 is pretty simple, and thus the acceleration from their code will be very marginal.

BeStrongok commented 7 months ago

Reference

Thanks for your guidance. I just run the benchmark between seal-cuda and seal on BFV protocol, the acceleration is obvious. Original version: CUDA version: But i haven't implemented it on Cheetah yet, this work is indeed non-trival and requires familiarity with both SEAL and CUDA. Hope i can make some useful results. :)

fionser commented 7 months ago

Less than 10x RotateRows is less impressive to me since 10 cores CPU is much more easier to get than a x100 NV card.