tlsnotary / tlsn

Rust implementation of the TLSNotary protocol
https://tlsnotary.org
263 stars 69 forks source link

Inspect timing-out aead tests #548

Closed th4s closed 3 weeks ago

th4s commented 2 months ago

PR #547 comments the aead tests, which are currently timing out. This needs further investigation as it is not locally reproducible.

th4s commented 1 month ago

PR #558 is used for investigating.

th4s commented 1 month ago

This deadlock is hard to fix. Things I tried so far:

Saying the deadlock vanishes means that I tried to run it several times (4-5) and didn't encounter it.

th4s commented 1 month ago

Making some progress by carefully adding println! statements. This run is interesting: https://github.com/th4s/tlsn/actions/runs/10506339441/job/29106915720 It hangs in decrypt_private -> verify_tag -> compute_tag_share -> share_keystream_block -> compute -> execute

It does not seem to hang in setup_assigned_values.

th4s commented 1 month ago

New hanging run with more logging: https://github.com/th4s/tlsn/actions/runs/10507218244/job/29108495136 It hangs inside generate (also in evaluate but this is probably just a consequence of waiting for generate).

My current bet is that it has to do with feed, flush and the limited backpressure of the TestSTExecutor.

th4s commented 1 month ago

New hanging run with more fine-grained logging: https://github.com/th4s/tlsn/actions/runs/10509969553/job/29118455831

Seems to be hanging here: https://github.com/privacy-scaling-explorations/mpz/blob/7783045420272defb0caf52c0b7ceeab03badb55/crates/mpz-garble/src/generator/mod.rs#L309-L318

th4s commented 1 month ago

This run https://github.com/th4s/tlsn/actions/runs/10523156760 provides evidence that it hangs on the evaluator side here: https://github.com/privacy-scaling-explorations/mpz/blob/8a6ae620fc67176111935a8b1ac8b10c63d8e253/crates/mpz-garble/src/evaluator/mod.rs#L393-L470

th4s commented 1 month ago

When running tests from the workspace root, the feature rayon although not specified, is activated by tlsn-verifier and tlsn-prover. This means the aead tests are run with feature rayon enabled.

th4s commented 1 month ago

What happens is the following: 1) When running our test-suite from the workspace root, tlsn-verifier/tlsn-prover activate rayon for mpz-common. This means that although not specified, tlsn-aead unit tests are run with this feature enabled. 2) When running unit tests, Rust uses 4 threads to parallelize test execution. But in each aead test the leader AND and the follower make use of rayon::spawn. 3) Rayon uses std::thread::available_parallelism to estimate the amount of parallelism (when the rayon threadpool is left unconfigured as we do), i.e. how many tasks are run in parallel in the rayon threadpool. I determined this to be 4 for our github CI. 4) All the 4 rayon-spawned evaluator tasks use expect_next in evaluate to get garbling batches from the generator. This means that when all 4 evaluator tasks start before a single generator can be run, the rayon threadpool is saturated. The generator can only queue their rayon tasks to send garbling batches but they are never executed because the evaluator tasks will never finish (since they are waiting for garbling batches) -> deadlock

sinui0 commented 1 month ago

:melting_face:

good work. Short term fix is to increase thread count, or decrease parallelism?

th4s commented 1 month ago

Yeah I think I am leaning towards increasing the thread count for the aead unit tests.

heeckhau commented 3 weeks ago

Fixed in #575