Open akoshelev opened 1 week ago
it is still happening: https://github.com/private-attribution/ipa/actions/runs/11136370331/job/30948030558?pr=1327 even after #1325
The current failure symptom is slightly different than the original. I'm also not clear on the relation between this issue and the stack overflow issue.
Is it possible that the iteration reduction in #1314 was not necessary, and if that's the case, do we want to revert it?
If some of the stack overflows are manifesting as hangs, should that be tracked as a separate issue from the overflows themselves?
If some of the stack overflows are manifesting as hangs, should that be tracked as a separate issue from the overflows themselves?
Yea we probably want @cberkhoff here as he was investigating that issue. I don't think I have good understanding whether these hangs were caused by stack overflows or not.
Is it possible that the iteration reduction in https://github.com/private-attribution/ipa/pull/1314 was not necessary, and if that's the case, do we want to revert it?
Potentially, I am still suspicious about the time it takes to finish this test
2024-09-24T17:09:50.271047Z INFO breakdown_reveal_aggregation{total=1474}:apply_dp_padding:shuffle_attribution_outputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-09-24T17:11:11.848035Z INFO breakdown_reveal_aggregation{total=1474}:shuffle_attribution_outputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=29.2s time.idle=52.3s
...
2024-09-24T17:11:11.928115Z INFO breakdown_reveal_aggregation{total=1474}:reveal_breakdowns{total=2622}: ipa_core::protocol::ipa_prf::aggregation::breakdown_reveal: new
2024-09-24T17:12:05.641235Z INFO breakdown_reveal_aggregation{total=1474}:aggregate_values{num_rows=118}: ipa_core::protocol::ipa_prf::aggregation: new
these are very long waits, so it could be Shuttle running in circles somewhere
The stack overflows were causing the test to fail, not hang.
hmm, that's weird. Do you have an example of such failure on the CI?
Some further observations about cargo test -p ipa-core --release --features "shuttle multi-threading" -- protocol::ipa_prf::tests::malicious
:
max_steps
limit.compute_and_add_tags
(Caveat: I determined this by watching the shuttle::current::context_switches()
counter. I am not certain this methodology is sound.) I filed #1330, but I didn't work on it right now, because we need to get CI healthy ASAP, and even with an improved compute_and_add_tags
, we might still need less padding to make the test fast enough.I am still not sure which of the following is/are true of the failures in CI:
max_steps
panic to all tasks (either a bug in shuttle or in how we tear down seq/parallel join on error) that allows the test to continue running.One thing I noticed when working with Shuttle is that it does not like standard rng. It vends its own random generator which should be used instead: https://docs.rs/shuttle/latest/shuttle/rand/index.html. I think we export that as rng.
If DP padding uses RNG to generate dummies (which I believe it does), it could be worth replacing it with crate::rng
. This does not explain why malicious shuffle breaks things though.
we can try to reproduce it with this seed: 4039690703696284216 https://github.com/private-attribution/ipa/actions/runs/11018449181/job/30598750956?pr=1307