Closed th4s closed 3 weeks ago
PR #558 is used for investigating.
This deadlock is hard to fix. Things I tried so far:
static
variables, i.e. shared state between unit tests -> nothing foundmpz-garble/deap
-> nothing found so far. .lock().unwrap()
with try_lock().unwrap()
-> deadlock vanishes: https://github.com/th4s/tlsn/pull/5set_default
in that it omits some logging statements. Couldn't make any sense out of it, will probably create an issue for this. However sometimes I was able to capture a little bit of output. This can be found in the failed runs in these PRsset_global_default
the deadlock vanishes.println!
statements the deadlock vanishes -> https://github.com/th4s/tlsn/pull/4Saying the deadlock vanishes means that I tried to run it several times (4-5) and didn't encounter it.
Making some progress by carefully adding println!
statements. This run is interesting: https://github.com/th4s/tlsn/actions/runs/10506339441/job/29106915720
It hangs in decrypt_private -> verify_tag -> compute_tag_share -> share_keystream_block -> compute -> execute
It does not seem to hang in setup_assigned_values
.
New hanging run with more logging: https://github.com/th4s/tlsn/actions/runs/10507218244/job/29108495136
It hangs inside generate
(also in evaluate
but this is probably just a consequence of waiting for ).generate
My current bet is that it has to do with feed, flush
and the limited backpressure of the TestSTExecutor
.
New hanging run with more fine-grained logging: https://github.com/th4s/tlsn/actions/runs/10509969553/job/29118455831
Seems to be hanging here: https://github.com/privacy-scaling-explorations/mpz/blob/7783045420272defb0caf52c0b7ceeab03badb55/crates/mpz-garble/src/generator/mod.rs#L309-L318
This run https://github.com/th4s/tlsn/actions/runs/10523156760 provides evidence that it hangs on the evaluator side here: https://github.com/privacy-scaling-explorations/mpz/blob/8a6ae620fc67176111935a8b1ac8b10c63d8e253/crates/mpz-garble/src/evaluator/mod.rs#L393-L470
When running tests from the workspace root, the feature rayon
although not specified, is activated by tlsn-verifier
and tlsn-prover
. This means the aead tests are run with feature rayon
enabled.
What happens is the following:
1) When running our test-suite from the workspace root, tlsn-verifier/tlsn-prover
activate rayon
for mpz-common
. This means that although not specified, tlsn-aead
unit tests are run with this feature enabled.
2) When running unit tests, Rust uses 4 threads to parallelize test execution. But in each aead test the leader AND and the follower make use of rayon::spawn
.
3) Rayon uses std::thread::available_parallelism
to estimate the amount of parallelism (when the rayon threadpool is left unconfigured as we do), i.e. how many tasks are run in parallel in the rayon threadpool. I determined this to be 4
for our github CI.
4) All the 4 rayon-spawned evaluator tasks use expect_next
in evaluate
to get garbling batches from the generator. This means that when all 4 evaluator tasks start before a single generator can be run, the rayon threadpool is saturated. The generator can only queue their rayon tasks to send garbling batches but they are never executed because the evaluator tasks will never finish (since they are waiting for garbling batches) -> deadlock
:melting_face:
good work. Short term fix is to increase thread count, or decrease parallelism?
Yeah I think I am leaning towards increasing the thread count for the aead unit tests.
Fixed in #575
PR #547 comments the aead tests, which are currently timing out. This needs further investigation as it is not locally reproducible.