Report collector times out talking to MPC ring when it handles large inputs

akoshelev commented 2 months ago

When running IPA on inputs with 20M rows or more, RC binary times out executing query_status command

 "Failed to get query status from helper 1: Failed to connect to ec2-44-234-150-14.us-west-2.compute.amazonaws.com:443: http2 error: keep-alive timed out"

akoshelev commented 2 months ago

Here is the client view of the world

2024-06-07T02:03:39.519115Z  INFO ipa_core::cli::playbook::ipa: submitting inputs to MPC helpers
...
2024-06-07T02:03:41.412501Z  INFO ipa_core::cli::playbook::ipa: Poll helpers for MPC status
2024-06-07T02:03:41.634659Z  INFO ipa_core::cli::playbook::ipa: Poll helpers for MPC status complete
2024-06-07T02:03:43.635482Z  INFO ipa_core::cli::playbook::ipa: Poll helpers for MPC status
2024-06-07T02:05:31.630683Z ERROR ipa_core::error: ThreadId(1) "main" panicked at /home/ec2-user/workspace/alex-ipa/ipa-core/src/cli/playbook/ipa.rs:138:14:
called `Result::unwrap()` on an `Err` value: "Failed to get query status from helper 0: Failed to connect to ec2-44-234-39-30.us-west-2.compute.amazonaws.com:443: http2 error: keep-alive timed out"

this is what server 1 is seeing

2024-06-07T02:03:39.526867Z  INFO oprf_ipa_query{sz=20000000}: ipa_core::query::runner::oprf_ipa: new
2024-06-07T02:03:39.526893Z  INFO oprf_ipa_query{sz=20000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 8, max_breakdown_key: 20, attribution_window_seconds: None, num_multi_bits: 3, plaintext_match_keys: true }
...
2024-06-07T02:03:41.627086Z  INFO ipa_core::query::processor: query_status: waiting for lock
2024-06-07T02:03:41.627104Z  INFO ipa_core::query::processor: query_status: got the lock
2024-06-07T02:03:42.021131Z  INFO oprf_ipa_query{sz=20000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-06-07T02:08:03.650256Z  INFO oprf_ipa_query{sz=20000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: close time.busy=257s time.idle=5.12s

it is pretty evident that H1 receives the status request sent at 02:03:41.412 - 02:03:41.62 and responds to it. But the next request sent at 03:43.63068 hasn't been registered in the application layer. Note that H1 started doing shuffle and the first step of it is to perform a large allocation and transform the input.

This would explain the time out, if there were only one thread running the app. I don't expect tokio::main to be single-threaded by default, so something is going on here.

akoshelev commented 2 months ago

I think we are starving Tokio runtime and that is not a good sign. I've added a simple tokio task that runs every two seconds like this

        tokio::spawn(async move {
            loop {
                ::tokio::time::sleep(std::time::Duration::from_secs(2)).await;
                tracing::info!("observer works")
            }
        });

Here is what I am seeing on one of the helpers

2024-06-07T03:46:13.999634Z  INFO ThreadId(02) ipa_core::net::server: observer works
... (every two seconds as expected)
2024-06-07T03:46:50.019413Z  INFO ThreadId(06) ipa_core::net::server: observer works
2024-06-07T03:46:51.069178Z  INFO ThreadId(06) ipa_core::query::processor: receive inputs: starting new query
2024-06-07T03:46:51.069206Z  INFO ThreadId(06) ipa_core::query::processor: receive inputs: query started
2024-06-07T03:46:51.073790Z  INFO ThreadId(30) ipa_core::query::processor: Id(1), query_status: waiting for lock
2024-06-07T03:46:51.073803Z  INFO ThreadId(30) ipa_core::query::processor: query_status: got the lock
2024-06-07T03:46:51.076610Z  INFO ThreadId(31) oprf_ipa_query{sz=10000000}: ipa_core::query::runner::oprf_ipa: new
2024-06-07T03:46:51.076634Z  INFO ThreadId(31) oprf_ipa_query{sz=10000000}: ipa_core::query::runner::oprf_ipa: New query: IpaQueryConfig { per_user_credit_cap: 8, max_breakdown_key: 20, attribution_window_seconds: None, num_multi_bits: 3, plaintext_match_keys: true }
2024-06-07T03:46:51.203231Z  INFO ThreadId(07) ipa_core::query::processor: Id(1), query_status: waiting for lock
2024-06-07T03:46:51.203250Z  INFO ThreadId(07) ipa_core::query::processor: query_status: got the lock
2024-06-07T03:46:51.458375Z  INFO ThreadId(31) ipa_core::query::processor: Id(1), query_status: waiting for lock
2024-06-07T03:46:51.458393Z  INFO ThreadId(31) ipa_core::query::processor: query_status: got the lock
2024-06-07T03:46:51.961352Z  INFO ThreadId(31) ipa_core::query::processor: Id(1), query_status: waiting for lock
2024-06-07T03:46:51.961371Z  INFO ThreadId(31) ipa_core::query::processor: query_status: got the lock
2024-06-07T03:46:52.020290Z  INFO ThreadId(07) ipa_core::net::server: observer works
2024-06-07T03:46:52.422040Z  INFO ThreadId(07) oprf_ipa_query{sz=10000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-06-07T03:48:08.027648Z  INFO ThreadId(30) ipa_core::net::server: observer works
2024-06-07T03:48:08.027735Z  INFO ThreadId(31) ipa_core::query::processor: Id(1), query_status: waiting for lock
2024-06-07T03:48:08.027744Z  INFO ThreadId(31) ipa_core::query::processor: query_status: got the lock

Between 03.46.52 and 03.48:08 there were no runs of other tokio tasks

akoshelev commented 2 months ago

https://github.com/tokio-rs/tokio/issues/4730 :(

andyleiserson commented 2 months ago

In this particular case, there are no lifetimes in the long-running task, so it should be pretty easy to use spawn_blocking. And maybe more generally, doing a long-running task implies not communicating, which implies not using the context (which is the lifetime that is hardest to avoid).

akoshelev commented 2 months ago

I considered it and honestly not too happy about it because of all places where we would need to add something like it. I think IO-bound pool executor may be fine here, but in other spots where we could be doing a lot of CPU work (PRSS), this won't be appropriate. There is also block_in_place that avoids this issue, but I am concerned about it too for the same reasons - blocks of infra code that are tied to tokio will appear in our code with no apparent reason, especially with the multi-threaded runtime.

I want to explore an option of having a monitor thread - I think it may be nice to detect blocked runtime and the performance treadeoff maybe acceptable - it will cause more wakeups but in one centralized place and we can control the frequency of it

andyleiserson commented 2 months ago

Is there a way for the monitor thread to reclaim the poller if it detects a long-running task?

If we do come up with a solution involving spawn_blocking, block_in_place, or yield_now, we can hide it behind something like ctx.for_each(iterator, closure). Maybe not ideal, but it seems like a simpler coding guideline than deciding between seq_join and parallel_join.

akoshelev commented 2 months ago

yea if we go down that road, I'll be in favor of providing an API like that. I see a lot of disadvantages here, so I want to try hard not to.

Is there a way for the monitor thread to reclaim the poller if it detects a long-running task?

I think the way it works is that monitor thread schedules an cheap future on the runtime it is given and that forces it to wake up at least one more thread to handle it (because others except one are parked and one is busy) and that thread, after completing that future, checks for IO events and polls the TCP buffer to get the request.

Having said that, it does solve the issue of unblocking the runtime, but we are still not getting any hyper requests while running shuffle_inputs, not sure if there is another issue with IO driver not getting properly migrated to the new worker thread.

2024-06-08T00:33:18.520812Z  INFO ThreadId(16) ipa_core::query::processor: query_status: got the lock
2024-06-08T00:33:18.880170Z  INFO ThreadId(16) oprf_ipa_query{sz=20000000}:shuffle_inputs: ipa_core::protocol::ipa_prf::shuffle: new
2024-06-08T00:33:20.123843Z  INFO ThreadId(19) ipa_core::net::server: observer works
2024-06-08T00:33:22.125028Z  INFO ThreadId(19) ipa_core::net::server: observer works
2024-06-08T00:33:24.126171Z  INFO ThreadId(19) ipa_core::net::server: observer works
...
2024-06-08T00:36:47.091983Z  INFO ThreadId(30) ipa_core::net::server: observer works
2024-06-08T00:36:47.153744Z  INFO ThreadId(30) ipa_core::query::processor: Id(1), query_status: waiting for lock
2024-06-08T00:36:47.153752Z  INFO ThreadId(30) ipa_core::query::processor: query_status: got the lock

akoshelev commented 2 months ago

I know the root cause now - it is the LIFO heuristic inside the scheduler that does not let hyper requests to be processed even if the runtime is unblocked

Worker 1, polls IO and gets into shuffle_inputs
RC sends a query_status request and it lands in Worker 1 LIFO slot
runtime is awaken by the monitor thread
A new worker is created (worker 2) and after processing dummy task it polls the IO driver
There are no new IO events (all helpers are allocating)
Worker 2 cannot steal LIFO task currently sitting in Worker 1's queue and it goes to sleep

Turning this on, resolved the issue, but it requires unstable tokio

akoshelev commented 2 months ago

mitigated in #1123

private-attribution / ipa

Report collector times out talking to MPC ring when it handles large inputs #1120