Closed rozag closed 1 year ago
Thanks for the pull request, makes sense. My initial idea was to optimize the memory allocations by doing each bit OT separately but of course this write-sync-read loop kills performance.
No need to simulate the latency on the benchmark, I would rather avoid the extra dependency.
Backstory
I've been working on a server that acts as a garbler for several quite large circuits. It's performance was inappropriately poor. Profiler made it clear that most of the time we wait for connection read/write. After digging some more I've noticed that
ot.CO
makes a lot of very small flushes. In my case I have ~20 MiB of data transferred over the TCP connection, which would be fine for several huge batches, butot.CO
Send
andRecieve
were sending 40-72 bytes, flushing, and waiting for the "response" to arrive before sending the next part. It is mentioned in theot/co.go
header that the implementation is based on emp-toolkit. I've noticed that they do this stage separation of write everything first, then flush, then wait for response. AndCO
does a loop with send-flush-read inside which makesp2p.Conn
flush these small batches of data.Fix
I propose to use the same stage separation approach. Basically, here we only flush once after everything is written. The downside: we'll loop twice, but it's still sort of
O(n)
. I've already tested this fix with my server, observed behavior: 4-5 minutes of execution before, 12-14 seconds after, which is a huge difference.Tests
Output
``` === RUN TestCO --- PASS: TestCO (0.00s) === RUN TestLabel --- PASS: TestLabel (0.00s) === RUN TestOTCO --- PASS: TestOTCO (0.01s) === RUN TestOTRSA --- PASS: TestOTRSA (0.95s) === RUN TestPipe --- PASS: TestPipe (0.00s) PASS ok github.com/markkurossi/mpc/ot 1.328s ```Benchmarks
This change doesn't seem to be influencing current benchmarks much:
Before
``` goos: darwin goarch: amd64 pkg: github.com/markkurossi/mpc/ot cpu: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz BenchmarkCO-8 3817 277631 ns/op BenchmarkLabelMul2-8 542378509 2.217 ns/op BenchmarkLabelMul4-8 522979458 2.205 ns/op BenchmarkLabelXor-8 1000000000 0.3096 ns/op BenchmarkOTCO_1-8 4683 238019 ns/op BenchmarkOTCO_8-8 1036 1133589 ns/op BenchmarkOTCO_16-8 537 2197241 ns/op BenchmarkOTCO_32-8 278 4344838 ns/op BenchmarkOTCO_64-8 139 8550312 ns/op BenchmarkOTRSA_2048_1-8 97 12104427 ns/op BenchmarkOTRSA_2048_8-8 12 97815929 ns/op BenchmarkOTRSA_2048_64-8 2 779272778 ns/op BenchmarkRSA512-8 3300 349835 ns/op BenchmarkRSA1024-8 619 1904981 ns/op BenchmarkRSA2048-8 96 12051810 ns/op PASS ok github.com/markkurossi/mpc/ot 23.257s ```After
``` goos: darwin goarch: amd64 pkg: github.com/markkurossi/mpc/ot cpu: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz BenchmarkCO-8 4022 278031 ns/op BenchmarkLabelMul2-8 541403588 2.209 ns/op BenchmarkLabelMul4-8 547191450 2.188 ns/op BenchmarkLabelXor-8 1000000000 0.3042 ns/op BenchmarkOTCO_1-8 4851 243142 ns/op BenchmarkOTCO_8-8 739 1596737 ns/op BenchmarkOTCO_16-8 370 3189234 ns/op BenchmarkOTCO_32-8 182 6561498 ns/op BenchmarkOTCO_64-8 93 12940329 ns/op BenchmarkOTRSA_2048_1-8 98 12186653 ns/op BenchmarkOTRSA_2048_8-8 12 96631495 ns/op BenchmarkOTRSA_2048_64-8 2 772423033 ns/op BenchmarkRSA512-8 3055 353650 ns/op BenchmarkRSA1024-8 634 1876732 ns/op BenchmarkRSA2048-8 99 12032335 ns/op PASS ok github.com/markkurossi/mpc/ot 23.513s ```At first I wanted to update
ot.Pipe
with something likeNewLatencyPipe(time.Duration)
and to performtime.Sleep
on e.g. eachWrite
to better represent my case of flushing often on a real TCP connection. The problem withPipe
here is that it doesn't really useFlush
, it simply reads/writes on demand.p2p.Conn
over something pipe-like with this latency emulation would probably be a more representative benchmark, butio.Pipe
won't do in this case, we'd deadlock. djherbis/nio has a nice pipe that'd probably do, but adding dependency for a benchmark seems to be an overkill. So, I decided to avoid benchmark changes at the moment, but would love to hear your opinion on this. If you find the proposed approach reasonable, I can do it.