IPA stalls when running OPRF protocol over 50,000 records

andyleiserson commented 1 month ago

Steps to reproduce:

Change the INPUT_SIZE definition in tests/common.mod.rs to 50000.
Run cargo test --release --no-default-features --features="cli test-fixture real-world-infra web-app disable-metrics multi-threading stall-detection compact-gate" -- https_semi_honest --nocapture.

Sample stall-detector output (this is from a run in @eriktaubeneck's infrastructure):

Helper 3: {
Helper 3: "step=protocol/convert_input_rows_to_prf/eval_prf/reveal_r", from=H2. Waiting to receive records ["[0..765]"].
Helper 3: }
Helper 3: 2024-05-16T01:11:27.337816Z WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=1130548 state=
Helper 2: {
Helper 2: "step=protocol/sort_by_timestamp/quicksort_pass1/compare/bit0", from=H1. Waiting to receive records ["[0..156]"].
Helper 2: }
Helper 2: 2024-05-16T01:11:27.339636Z WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=1181394 state=
Helper 1: {
Helper 1: "step=protocol/convert_input_rows_to_prf/eval_prf/revealz", from=H3. Waiting to receive records ["[0..765]"].
Helper 1: }
Helper 1: 2024-05-16T01:11:27.337394Z WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=1230176 state=
Helper 2: {
Helper 2: "step=protocol/sort_by_timestamp/quicksort_pass1/compare/bit0", from=H1. Waiting to receive records ["[0..156]"].
Helper 2: }
Helper 2: 2024-05-16T01:11:57.342580Z WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=1181394 state=
Helper 1: {
Helper 1: "step=protocol/convert_input_rows_to_prf/eval_prf/revealz", from=H3. Waiting to receive records ["[0..765]"].
Helper 1: }
Helper 1: 2024-05-16T01:11:57.340488Z WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=1230176 state=
Helper 3: {
Helper 3: "step=protocol/convert_input_rows_to_prf/eval_prf/reveal_r", from=H2. Waiting to receive records ["[0..765]"].
Helper 3: }

Test command for running oneshot bench with in-memory infra, which does not hang: cargo bench --bench oneshot_ipa --features="enable-benches disable-metrics multi-threading compact-gate" -- -n 50000 -c 8 -u 10 -a 1024 -j 8.

andyleiserson commented 1 month ago

Immediate workaround: #1073 Alternate workaround: #1087

Open items:

Conclusively determine whether HTTP buffering is the root cause here.
Make channel buffering or active work in IPA infra configurable based on usage (e.g. vectorization status and/or payload size).
Make active work configurable when invoking helper.

akoshelev commented 1 month ago

I am able to reproduce this issue (or a similar one because I am seeing protocol getting stuck at slightly different steps), but it takes more than one run to get a stall. Logs grow big and I am only interested in the last failed run, so I am using the following command to trim them

for IDENTITY in {1..3}; do tail -r /tmp/h$IDENTITY.log | awk '{print} /ipa_core::query::runner::oprf_ipa: new/{exit}' | tail -r > /tmp/hl$IDENTITY.log; done

Basically this just reverses the log file, looks for an indication of new query (oprf_ipa: new) and terminates the pipe and reverses the pipe again.

Note that on MacOS the default awk is too old, I am using the one vended by brew.

akoshelev commented 1 month ago

I am suspecting a flow control issue, and not a protocol bug. Some evidence that backs this claim from the recent stall I was observing.

H3 is stuck waiting for data from H2

2024-05-27T17:52:07.525291Z  WARN :stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=1152528 state=
{
"gate=/ipa_prf/eval_prf/mult_mask_with_p_r_f_input", from=H2. Waiting to receive records ["[0..781]"].
}

The application layer on H2 registered a send event

2024-05-27T17:47:59.392227Z TRACE send_stream{to=H3 gate=gate=/ipa_prf/eval_prf/mult_mask_with_p_r_f_input}: ipa_core::helpers::gateway::send: new
2024-05-27T17:47:59.392642Z TRACE send_stream{to=H3 gate=gate=/ipa_prf/eval_prf/mult_mask_with_p_r_f_input}: ipa_core::helpers::buffers::ordering_sender: Sending next 1601536 bytes. next = 783. stream closed = true, alloc = 0x3a853a400
2024-05-27T17:47:59.392645Z TRACE send_stream{to=H3 gate=gate=/ipa_prf/eval_prf/mult_mask_with_p_r_f_input}: ipa_core::helpers::gateway::send: close time.busy=409µs time.idle=8.62µ

the size of this send is big and matches @andyleiserson's estimate in #1073: 1601536 bytes (~1.6MB). H2 did everything right.

Right after this send, there was another one, as large as this one

2024-05-27T17:47:59.582779Z TRACE send_stream{to=H3 gate=gate=/ipa_prf/eval_prf/revealz}: ipa_core::helpers::buffers::ordering_sender: Sending next 1601536 bytes. next = 783. stream closed = true, alloc = 0x3a9b95500
2024-05-27T17:47:59.582781Z TRACE send_stream{to=H3 gate=gate=/ipa_prf/eval_prf/revealz}: ipa_core::helpers::gateway::send: close time.busy=514µs time.idle=1.46µs

On H3 UnorderedReceiver is actually polling the underlying stream, but it gets very small chunks back

2024-05-27T17:47:59.393886Z TRACE oprf_ipa_query{sz=50000}:compute_prf_for_inputs:receive{i=0 from=H2 gate="/ipa_prf/eval_prf/mult_mask_with_p_r_f_input"}: ipa_core::helpers::buffers::unordered_receiver: received next chunk: 1
2024-05-27T17:47:59.393886Z TRACE oprf_ipa_query{sz=50000}:compute_prf_for_inputs:receive{i=0 from=H2 gate="/ipa_prf/eval_prf/mult_mask_with_p_r_f_input"}: ipa_core::helpers::buffers::unordered_receiver: received next chunk: 1
2024-05-27T17:47:59.394255Z TRACE oprf_ipa_query{sz=50000}:compute_prf_for_inputs:receive{i=0 from=H2 gate="/ipa_prf/eval_prf/mult_mask_with_p_r_f_input"}: ipa_core::helpers::buffers::unordered_receiver: received next chunk: 1
2024-05-27T17:47:59.394255Z TRACE oprf_ipa_query{sz=50000}:compute_prf_for_inputs:receive{i=0 from=H2 gate="/ipa_prf/eval_prf/mult_mask_with_p_r_f_input"}: ipa_core::helpers::buffers::unordered_receiver: received next chunk: 1

Some other things that I noticed

There is a max send buffer per stream set to 400 Kb here. Our send size for these steps exceeds it.
HTTP2 flow control stipulates two layers on which the backpressure is applied: per stream and per connection. The latter may be causing this issue.

Next, I want to look at HTTP2 frames to confirm it. I suspect that we are observing another variation of seq_join parallelism problem:

$H_1$ sends a large chunk of data for step $S_1$ and $S_2$ to $H_2$ in parallel. Chunks are > 1Mb in size
$S_2$ gets sent over the wire first, along with other streams in-flight consuming the entire budget for $H_1$ -> $H_2$ connection
$H_2$ is stuck trying to get data for $S_1$ step and not moving forward to $S_2$.
Deadlock

akoshelev commented 1 month ago

More evidence supporting the theory above. I added some logs to h2 crate and it does indicate that it ran out of connection capacity

2024-05-27T21:35:25.869005Z TRACE send_stream{to=H1 gate=gate=/ipa_prf/eval_prf/mult_mask_with_p_r_f_input}: ipa_core::helpers::buffers::ordering_sender: Sending next 1601536 bytes. next = 783. stream closed = true, alloc = 0x3b9f3a140
2024-05-27T21:35:25.869012Z TRACE send_stream{to=H1 gate=gate=/ipa_prf/eval_prf/mult_mask_with_p_r_f_input}: ipa_core::helpers::gateway::send: close time.busy=584µs time.idle=4.75µs
2024-05-27T21:35:25.869017Z  WARN h2::proto::streams::prioritize: ipafix: stream requires extra capacity 1601536, but connection window does not have enough 823360
2024-05-27T21:35:25.869021Z  WARN h2::proto::streams::prioritize: ipafix: stream requires extra capacity 1601537, but connection window does not have enough 823360

I am also not seeing any activity from Hyper on the sender (client) after that, I suspect backpressure to be the culprit

I am trying to figure out the window size at the moment when protocol stalls to confirm this theory

akoshelev commented 1 month ago

I can confirm the root cause of this issue being connection level flow control for HTTP2. I ran a test with it disabled and left it overnight - no failures reported

private-attribution / ipa

IPA stalls when running OPRF protocol over 50,000 records #1085