private-attribution / ipa

A raw implementation of Interoperable Private Attribution
MIT License
41 stars 23 forks source link

Cross Cloud Stall #1243

Open eriktaubeneck opened 2 weeks ago

eriktaubeneck commented 2 weeks ago

As we've begun testing with external helper parties, we've been able to run tests across AWS, GCloud, and Azure. In running these tests, we've run into a stall in certain circumstances.

To further complicate the matter, I set up a free-tier Azure account to test with, and even with those limited resources, I get a successful run all the way up to 1M input rows. It seems the issue has something to do with the Helper Party's network settings, and I'm working with them to troubleshoot what setting that might be.

However, it still seems odd to me that with whatever settings they have, we successfully run small queries (1k, 10k, 100k) but stall reliably at 1M. Here are the logs from two such stalls:

Test 1

Helper 1 (AWS)

2024-08-28T21:02:45.226772Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: new
2024-08-28T21:02:45.226789Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 16, max_breakdown_key: 32, attribution_window_seconds: None, num_multi_bits: 3, with_dp: 1, epsilon: 5.0, plaintext_match_keys: true }
2024-08-28T21:02:52.779813Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-08-28T21:03:08.147686Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=12.3s time.idle=3.12s
2024-08-28T21:03:08.147713Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: new
2024-08-28T21:04:19.714682Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: close time.busy=47.8s time.idle=23.8s
2024-08-28T21:04:19.797130Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:19.960382Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=163ms time.idle=23.4µs
2024-08-28T21:04:30.875851Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:30.924675Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:05:15.220022Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16084680 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}
2024-08-28T21:05:45.281283Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16084680 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}
2024-08-28T21:06:15.342024Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16084680 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}

Helper 2 (Azure)

2024-08-28T21:02:45.213661Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 16, max_breakdown_key: 32, attribution_window_seconds: None, num_multi_bits: 3, with_dp: 1, epsilon: 5.0, plaintext_match_keys: true }
2024-08-28T21:02:52.262716Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-08-28T21:03:13.558012Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=19.9s time.idle=1.37s
2024-08-28T21:03:13.558072Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: new
2024-08-28T21:04:29.303312Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: close time.busy=61.2s time.idle=14.5s
2024-08-28T21:04:29.425847Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:29.683701Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=258ms time.idle=48.4µs
2024-08-28T21:04:30.873737Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:30.946410Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:30.946450Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=3.60µs time.idle=38.0µs
2024-08-28T21:04:30.946484Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:aggregate{streams=818287}: ipa_core::protocol::ipa_prf::aggregation: new
2024-08-28T21:05:15.302007Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17953546 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:05:45.388936Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17953546 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:06:15.474471Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17953546 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:06:45.562160Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17953546 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:07:15.647331Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17953546 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:07:45.733160Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17953546 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:08:15.818360Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17953546 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}

Helper 3 (GCloud)

2024-08-28T21:02:45.285997Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: new
2024-08-28T21:02:45.286015Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 16, max_breakdown_key: 32, attribution_window_seconds: None, num_multi_bits: 3, with_dp: 1, epsilon: 5.0, plaintext_match_keys: true }
2024-08-28T21:02:52.220978Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-08-28T21:03:12.746668Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=8.69s time.idle=11.8s
2024-08-28T21:03:12.746705Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: new
2024-08-28T21:04:19.697725Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: close time.busy=46.5s time.idle=20.4s
2024-08-28T21:04:19.762973Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:19.904475Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=141ms time.idle=24.6µs
2024-08-28T21:04:30.875406Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:30.916862Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:04:30.916876Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=983ns time.idle=13.9µs
2024-08-28T21:04:30.916888Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:aggregate{streams=818287}: ipa_core::protocol::ipa_prf::aggregation: new
2024-08-28T21:05:15.189472Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16953590 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}
2024-08-28T21:05:45.238139Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16953590 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}
2024-08-28T21:06:15.288897Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16953590 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}

Test 2

Helper 1 (AWS)

2024-08-28T21:30:50.778798Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: new
2024-08-28T21:30:50.778818Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 16, max_breakdown_key: 32, attribution_window_seconds: None, num_multi_bits: 3, with_dp: 1, epsilon: 5.0, plaintext_match_keys: true }
2024-08-28T21:30:58.197009Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-08-28T21:31:13.545415Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=12.2s time.idle=3.17s
2024-08-28T21:31:13.545443Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: new
2024-08-28T21:32:25.722318Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: close time.busy=48.8s time.idle=23.4s
2024-08-28T21:32:25.804774Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:25.968072Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=163ms time.idle=20.5µs
2024-08-28T21:32:27.321374Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:27.363165Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:33:20.804202Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16085142 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}
2024-08-28T21:33:50.864772Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16085142 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}
2024-08-28T21:34:20.924839Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16085142 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}
2024-08-28T21:34:50.987202Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16085142 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}
2024-08-28T21:35:21.047042Z  WARN stall_detector{role=H1}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16085142 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event\", from=H2. Waiting to receive records [\"[32768..65535]\"].
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key\", from=H2. Waiting to receive records [\"[32768..65535]\"].
}

Helper 2 (Azure)

2024-08-28T21:30:50.764504Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: new
2024-08-28T21:30:50.764565Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 16, max_breakdown_key: 32, attribution_window_seconds: None, num_multi_bits: 3, with_dp: 1, epsilon: 5.0, plaintext_match_keys: true }
2024-08-28T21:30:57.809037Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-08-28T21:31:18.928310Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=19.7s time.idle=1.41s
2024-08-28T21:31:18.928366Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: new
2024-08-28T21:32:25.724688Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: close time.busy=55.3s time.idle=11.5s
2024-08-28T21:32:25.851633Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:26.111279Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=260ms time.idle=46.1µs
2024-08-28T21:32:27.319176Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:27.391233Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:27.391280Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=2.70µs time.idle=45.7µs
2024-08-28T21:32:27.391298Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:aggregate{streams=818287}: ipa_core::protocol::ipa_prf::aggregation: new
2024-08-28T21:33:20.850641Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17954008 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:33:50.940075Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17954008 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:34:21.027969Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17954008 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}
2024-08-28T21:34:51.118030Z  WARN stall_detector{role=H2}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=17954008 state=
{
    "gate=/ipa_prf/attribution/binary_validator/row01/ever_encountered_source_event", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/binary_validator/row01/attributed_breakdown_key", from=H3. Waiting to receive records ["[32768..32795]"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b001/bit2", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit0", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit1", from=H3. Waiting to receive records ["0"].
    "gate=/ipa_prf/attribution/aggregate/move_to_bucket/b002/bit2", from=H3. Waiting to receive records ["0"].
}

(This last message repeats a until we stop it. this is truncated to avoid redundancy.)

Helper 3 (GCloud)

2024-08-28T21:28:33.690840Z  INFO ipa_core::cli::verbosity: Logging setup at level info
2024-08-28T21:28:33.692540Z  INFO ipa_core::net::server: server listening on https://0.0.0.0:443
2024-08-28T21:30:50.731376Z  INFO stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: new
2024-08-28T21:30:50.885449Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: new
2024-08-28T21:30:50.885471Z  INFO oprf_ipa_query{sz=1000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 16, max_breakdown_key: 32, attribution_window_seconds: None, num_multi_bits: 3, with_dp: 1, epsilon: 5.0, plaintext_match_keys: true }
2024-08-28T21:30:57.808309Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-08-28T21:31:18.152382Z  INFO oprf_ipa_query{sz=1000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=8.74s time.idle=11.6s
2024-08-28T21:31:18.152418Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: new
2024-08-28T21:31:20.900692Z ERROR ipa_core::net::server: connection error: Connection reset by peer (os error 104)
2024-08-28T21:31:37.626172Z ERROR ipa_core::net::server: connection error: peer is incompatible: SignatureAlgorithmsExtensionRequired
2024-08-28T21:31:39.299970Z ERROR ipa_core::net::server: connection error: peer is incompatible: SignatureAlgorithmsExtensionRequired
2024-08-28T21:31:40.809422Z ERROR ipa_core::net::server: connection error: peer is incompatible: SignatureAlgorithmsExtensionRequired
2024-08-28T21:32:25.709410Z  INFO oprf_ipa_query{sz=1000000}:compute_prf_for_inputs: ipa_core::protocol::ipa_prf: close time.busy=50.7s time.idle=16.9s
2024-08-28T21:32:25.774039Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:25.916829Z  INFO oprf_ipa_query{sz=1000000}:histograms_ranges_sortkeys: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=143ms time.idle=12.5µs
2024-08-28T21:32:27.322502Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:27.359139Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: new
2024-08-28T21:32:27.359164Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:attribute_cap{unique_match_keys=163507}: ipa_core::protocol::ipa_prf::prf_sharding: close time.busy=980ns time.
2024-08-28T21:32:27.359172Z  INFO oprf_ipa_query{sz=1000000}:attribute_cap_aggregate:aggregate{streams=818287}: ipa_core::protocol::ipa_prf::aggregation: new
2024-08-28T21:33:20.787787Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16954052 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}
2024-08-28T21:33:50.840407Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16954052 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}
2024-08-28T21:34:20.888997Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16954052 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}
2024-08-28T21:34:50.938871Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16954052 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}
2024-08-28T21:35:20.987820Z  WARN stall_detector{role=H3}: ipa_core::helpers::gateway::stall_detection::gateway: Helper is stalled sn=16954052 state=
{
    \"gate=/ipa_prf/attribution/binary_validator/row01/attributed_trigger_value/did_trigger_get_attributed\", from=H1. Waiting to receive records [\"[32768..32795]\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit0\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit1\", from=H1. Waiting to receive records [\"0\"].
    \"gate=/ipa_prf/attribution/aggregate/move_to_bucket/b000/bit2\", from=H1. Waiting to receive records [\"0\"].
}
andyleiserson commented 2 weeks ago

I suspect this may be the same "data in flight exceeds available buffering, leading to a deadlock" behavior as we saw in the past (#1073, #1085, #1104).

In #1245 I add a helper command line option to override the active work. Either using that branch, or using a helper built with a manually adjusted value, and investigating whether reducing active_work resolves the problem would be an interesting data point.

The default active_work is currently 32,768. I would try reducing it all the way to something like 32 or 128. If that does resolve the issue, then we may want to investigate exactly where it becomes problematic.