markkurossi / mpc

Secure Multi-Party Computation (MPC) with Go. This project implements secure two-party computation with Garbled circuit protocol.
https://www.markkurossi.com/mpcl/index.html
MIT License
110 stars 22 forks source link

Add batched I/O for CO OT #13

Closed rozag closed 1 year ago

rozag commented 1 year ago

Backstory

I've been working on a server that acts as a garbler for several quite large circuits. It's performance was inappropriately poor. Profiler made it clear that most of the time we wait for connection read/write. After digging some more I've noticed that ot.CO makes a lot of very small flushes. In my case I have ~20 MiB of data transferred over the TCP connection, which would be fine for several huge batches, but ot.CO Send and Recieve were sending 40-72 bytes, flushing, and waiting for the "response" to arrive before sending the next part. It is mentioned in the ot/co.go header that the implementation is based on emp-toolkit. I've noticed that they do this stage separation of write everything first, then flush, then wait for response. And CO does a loop with send-flush-read inside which makes p2p.Conn flush these small batches of data.

Fix

I propose to use the same stage separation approach. Basically, here we only flush once after everything is written. The downside: we'll loop twice, but it's still sort of O(n). I've already tested this fix with my server, observed behavior: 4-5 minutes of execution before, 12-14 seconds after, which is a huge difference.

Tests

$ cd ot && go test -v --count 1
Output ``` === RUN TestCO --- PASS: TestCO (0.00s) === RUN TestLabel --- PASS: TestLabel (0.00s) === RUN TestOTCO --- PASS: TestOTCO (0.01s) === RUN TestOTRSA --- PASS: TestOTRSA (0.95s) === RUN TestPipe --- PASS: TestPipe (0.00s) PASS ok github.com/markkurossi/mpc/ot 1.328s ```

Benchmarks

$ cd ot && go test --bench .

This change doesn't seem to be influencing current benchmarks much:

Before ``` goos: darwin goarch: amd64 pkg: github.com/markkurossi/mpc/ot cpu: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz BenchmarkCO-8 3817 277631 ns/op BenchmarkLabelMul2-8 542378509 2.217 ns/op BenchmarkLabelMul4-8 522979458 2.205 ns/op BenchmarkLabelXor-8 1000000000 0.3096 ns/op BenchmarkOTCO_1-8 4683 238019 ns/op BenchmarkOTCO_8-8 1036 1133589 ns/op BenchmarkOTCO_16-8 537 2197241 ns/op BenchmarkOTCO_32-8 278 4344838 ns/op BenchmarkOTCO_64-8 139 8550312 ns/op BenchmarkOTRSA_2048_1-8 97 12104427 ns/op BenchmarkOTRSA_2048_8-8 12 97815929 ns/op BenchmarkOTRSA_2048_64-8 2 779272778 ns/op BenchmarkRSA512-8 3300 349835 ns/op BenchmarkRSA1024-8 619 1904981 ns/op BenchmarkRSA2048-8 96 12051810 ns/op PASS ok github.com/markkurossi/mpc/ot 23.257s ```
After ``` goos: darwin goarch: amd64 pkg: github.com/markkurossi/mpc/ot cpu: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz BenchmarkCO-8 4022 278031 ns/op BenchmarkLabelMul2-8 541403588 2.209 ns/op BenchmarkLabelMul4-8 547191450 2.188 ns/op BenchmarkLabelXor-8 1000000000 0.3042 ns/op BenchmarkOTCO_1-8 4851 243142 ns/op BenchmarkOTCO_8-8 739 1596737 ns/op BenchmarkOTCO_16-8 370 3189234 ns/op BenchmarkOTCO_32-8 182 6561498 ns/op BenchmarkOTCO_64-8 93 12940329 ns/op BenchmarkOTRSA_2048_1-8 98 12186653 ns/op BenchmarkOTRSA_2048_8-8 12 96631495 ns/op BenchmarkOTRSA_2048_64-8 2 772423033 ns/op BenchmarkRSA512-8 3055 353650 ns/op BenchmarkRSA1024-8 634 1876732 ns/op BenchmarkRSA2048-8 99 12032335 ns/op PASS ok github.com/markkurossi/mpc/ot 23.513s ```

At first I wanted to update ot.Pipe with something like NewLatencyPipe(time.Duration) and to perform time.Sleep on e.g. each Write to better represent my case of flushing often on a real TCP connection. The problem with Pipe here is that it doesn't really use Flush, it simply reads/writes on demand. p2p.Conn over something pipe-like with this latency emulation would probably be a more representative benchmark, but io.Pipe won't do in this case, we'd deadlock. djherbis/nio has a nice pipe that'd probably do, but adding dependency for a benchmark seems to be an overkill. So, I decided to avoid benchmark changes at the moment, but would love to hear your opinion on this. If you find the proposed approach reasonable, I can do it.

markkurossi commented 1 year ago

Thanks for the pull request, makes sense. My initial idea was to optimize the memory allocations by doing each bit OT separately but of course this write-sync-read loop kills performance.

No need to simulate the latency on the benchmark, I would rather avoid the extra dependency.