The two things that stand out here are tweaks to matrix packing and potentially more efficient ways to do some of the non-linear functions. However, it's not clear to me how custom these are tailored to the particular BERT architecture they're designing them for.
Paper 2024/1881
THOR: Secure Transformer Inference with Homomorphic Encryption
Jungho Moon, Hanyang University
Dongwoo Yoo, Yonsei University
Xiaoqian Jiang, University of Texas, Health Science Center at Houston
Miran Kim, Hanyang University
Abstract
As language models are increasingly deployed in cloud environments, privacy concerns have become a significant issue. To address this, we design THOR, a secure inference framework for transformer models on encrypted data. Specifically, we first propose new fast matrix multiplication algorithms based on diagonal-major order encoding and extend them to parallel matrix computation through the compact ciphertext packing technique. Second, we design efficient protocols for secure computations of four non-linear functions such as softmax, LayerNorm, GELU, and Tanh, by integrating advanced underlying approximation methods with tailored optimizations. Our matrix multiplication algorithms reduce the number of key-switching operations in the linear layers of the attention block in the BERT-base model by up to 14.5x, compared to the state-of-the-art HE-based secure inference protocol (Park et al., Preprint). Combined with cryptographic optimizations, our experimental results demonstrate that THOR provides secure inference for the BERT-base model with a latency of 10.43 minutes on a single GPU, while maintaining comparable inference accuracy on the MRPC dataset.
The two things that stand out here are tweaks to matrix packing and potentially more efficient ways to do some of the non-linear functions. However, it's not clear to me how custom these are tailored to the particular BERT architecture they're designing them for.
https://eprint.iacr.org/2024/1881
Paper 2024/1881 THOR: Secure Transformer Inference with Homomorphic Encryption Jungho Moon, Hanyang University Dongwoo Yoo, Yonsei University Xiaoqian Jiang, University of Texas, Health Science Center at Houston Miran Kim, Hanyang University Abstract As language models are increasingly deployed in cloud environments, privacy concerns have become a significant issue. To address this, we design THOR, a secure inference framework for transformer models on encrypted data. Specifically, we first propose new fast matrix multiplication algorithms based on diagonal-major order encoding and extend them to parallel matrix computation through the compact ciphertext packing technique. Second, we design efficient protocols for secure computations of four non-linear functions such as softmax, LayerNorm, GELU, and Tanh, by integrating advanced underlying approximation methods with tailored optimizations. Our matrix multiplication algorithms reduce the number of key-switching operations in the linear layers of the attention block in the BERT-base model by up to 14.5x, compared to the state-of-the-art HE-based secure inference protocol (Park et al., Preprint). Combined with cryptographic optimizations, our experimental results demonstrate that THOR provides secure inference for the BERT-base model with a latency of 10.43 minutes on a single GPU, while maintaining comparable inference accuracy on the MRPC dataset.