Inspect timing-out aead tests

th4s commented 2 months ago

PR #547 comments the aead tests, which are currently timing out. This needs further investigation as it is not locally reproducible.

th4s commented 1 month ago

PR #558 is used for investigating.

th4s commented 1 month ago

This deadlock is hard to fix. Things I tried so far:

I could not reproduce this bug on my local machine, even when using act and running it in docker.
Deadlock seems to vanish when executing a single AEAD test. It is strange that either all 4 of the started aead tests hang or all pass. I never observed some test passing and another hanging.
Looking for static variables, i.e. shared state between unit tests -> nothing found
Manual code inspection, especially of mpz-garble/deap-> nothing found so far.
Replacing all .lock().unwrap() with try_lock().unwrap() -> deadlock vanishes: https://github.com/th4s/tlsn/pull/5
Adding tracing.
- There seems to be a problem with set_default in that it omits some logging statements. Couldn't make any sense out of it, will probably create an issue for this. However sometimes I was able to capture a little bit of output. This can be found in the failed runs in these PRs
- When trying to add proper tracing via set_global_default the deadlock vanishes.
Removing tracing and using println! statements the deadlock vanishes -> https://github.com/th4s/tlsn/pull/4
Trying to use Tokio's task dump feature -> does not work because the future running does not support it. You get an empty log.
Using shuttle testing locally to trigger the deadlock. -> all tests pass

Saying the deadlock vanishes means that I tried to run it several times (4-5) and didn't encounter it.

th4s commented 1 month ago

Making some progress by carefully adding println! statements. This run is interesting: https://github.com/th4s/tlsn/actions/runs/10506339441/job/29106915720 It hangs in decrypt_private -> verify_tag -> compute_tag_share -> share_keystream_block -> compute -> execute

It does not seem to hang in setup_assigned_values.

th4s commented 1 month ago

New hanging run with more logging: https://github.com/th4s/tlsn/actions/runs/10507218244/job/29108495136 It hangs inside generate (also in evaluate ~~but this is probably just a consequence of waiting for generate~~).

~~My current bet is that it has to do with feed, flush and the limited backpressure of the TestSTExecutor.~~

th4s commented 1 month ago

New hanging run with more fine-grained logging: https://github.com/th4s/tlsn/actions/runs/10509969553/job/29118455831

Seems to be hanging here: https://github.com/privacy-scaling-explorations/mpz/blob/7783045420272defb0caf52c0b7ceeab03badb55/crates/mpz-garble/src/generator/mod.rs#L309-L318

th4s commented 1 month ago

This run https://github.com/th4s/tlsn/actions/runs/10523156760 provides evidence that it hangs on the evaluator side here: https://github.com/privacy-scaling-explorations/mpz/blob/8a6ae620fc67176111935a8b1ac8b10c63d8e253/crates/mpz-garble/src/evaluator/mod.rs#L393-L470

th4s commented 1 month ago

When running tests from the workspace root, the feature rayon although not specified, is activated by tlsn-verifier and tlsn-prover. This means the aead tests are run with feature rayon enabled.

th4s commented 1 month ago

What happens is the following: 1) When running our test-suite from the workspace root, tlsn-verifier/tlsn-prover activate rayon for mpz-common. This means that although not specified, tlsn-aead unit tests are run with this feature enabled. 2) When running unit tests, Rust uses 4 threads to parallelize test execution. But in each aead test the leader AND and the follower make use of rayon::spawn. 3) Rayon uses std::thread::available_parallelism to estimate the amount of parallelism (when the rayon threadpool is left unconfigured as we do), i.e. how many tasks are run in parallel in the rayon threadpool. I determined this to be 4 for our github CI. 4) All the 4 rayon-spawned evaluator tasks use expect_next in evaluate to get garbling batches from the generator. This means that when all 4 evaluator tasks start before a single generator can be run, the rayon threadpool is saturated. The generator can only queue their rayon tasks to send garbling batches but they are never executed because the evaluator tasks will never finish (since they are waiting for garbling batches) -> deadlock

sinui0 commented 1 month ago

:melting_face:

good work. Short term fix is to increase thread count, or decrease parallelism?

th4s commented 1 month ago

Yeah I think I am leaning towards increasing the thread count for the aead unit tests.

heeckhau commented 3 weeks ago

Fixed in #575

tlsnotary / tlsn

Inspect timing-out aead tests #548