Improve `Block` with SIMD

sinui0 commented 1 year ago

Right now the backing type of Block is [u8; 16] as it was simple to start with. However, core arrays can not always take advantage of auto-vectorization/SIMD for important operations.

For example, on my machine, it currently takes ~3ns to compute BitXor between two blocks. If using std::arch::x86_64::__mm128i this would be <1ns. We can of course just convert the [u8; 16] to/from __mm128i for the operation, but the overhead of doing so erases any performance gains.

Block would ideally use an u8x16 SIMD backing type depending on target architecture. Unfortunately std::simd is not stable yet so we will have to provide implementations for the various architectures.

To briefly motivate doing this, after some quick profiling and napkin math it seems that garbling a circuit is currently bottlenecked on BitXor, not the CCR hashing!

For example, the AES circuit has 6500 AND and 30,163 XOR gates. Garbling this circuit currently takes ~350 microseconds on my machine. Garbling an AND gate requires up to 9 XOR ops. In total that is 88,663 XOR ops to garble AES, which adds up to 266 microseconds or roughly 75% of CPU time.

xiangxiecrypto commented 1 year ago

you can take the implementation from emp-rust https://github.com/pado-labs/emp-rust/blob/main/emp-tool/src/block.rs

sinui0 commented 1 year ago

@xiangxiecrypto thanks will definitely borrow from there.

I quickly hacked together something locally for x86 and found only an 8% improvement for garbling AES. So I suspect my napkin math/profiling above was incorrect. Nonetheless this feature should be pursued.

privacy-scaling-explorations / mpz

Improve `Block` with SIMD #84