private-attribution / ipa

A raw implementation of Interoperable Private Attribution
MIT License
42 stars 25 forks source link

Introduce IpaRuntime and plumb it all the way down to executor #1336

Closed akoshelev closed 1 month ago

akoshelev commented 1 month ago

The time has come to have a need to have a precise control over which runtime is used to run queries. The reason for that is that we had another occurrence of HTTP keep-alive timeout that aligns with OPRF computation.

https://draft-mpc.vercel.app/query/view/nervy-fret2024-10-05T0245

2024-10-05T03:57:35.203964 - 2024-10-05T03:57:35.203922Z  INFO oprf_ipa_query{sz=50000000}:compute_prf_for_inputs: ipa_core::protocol::context::batcher: batch 30 is ready for validation
2024-10-05T03:57:35.206943 - 2024-10-05T03:57:35.206901Z  INFO oprf_ipa_query{sz=50000000}:compute_prf_for_inputs: ipa_core::protocol::context::batcher: batch 31 is ready for validation
2024-10-05T03:57:35.209813 - 2024-10-05T03:57:35.209777Z  INFO oprf_ipa_query{sz=50000000}:compute_prf_for_inputs: ipa_core::protocol::context::batcher: batch 32 is ready for validation
2024-10-05T03:58:28.720077 - 2024-10-05T03:58:28.715520Z ERROR ipa_core::error: ThreadId(9) "tokio-runtime-worker" panicked at ipa-core/src/helpers/gateway/send.rs:222:30:
2024-10-05T03:58:28.720653 - {channel_id:?} receiving end should be accepted by transport: SendToRoleError(H1, ConnectError { dest: "helper1.ipa-helper.dev", inner: hyper_util::client::legacy::Error(SendRequest,>
2024-10-05T03:58:28.720957 - stack trace:
2024-10-05T03:58:28.721250 -    0: ipa_core::error::set_global_panic_hook::{{closure}}
2024-10-05T03:58:28.721524 -    1: std::panicking::rust_panic_with_hook
2024-10-05T03:58:28.721776 -    2: std::panicking::begin_panic_handler::{{closure}}
2024-10-05T03:58:28.722034 -    3: std::sys_common::backtrace::__rust_end_short_backtrace
2024-10-05T03:58:28.722281 -    4: rust_begin_unwind
2024-10-05T03:58:28.722522 -    5: core::panicking::panic_fmt
2024-10-05T03:58:28.722764 -    6: core::result::unwrap_failed
2024-10-05T03:58:28.723006 -    7: ipa_core::helpers::gateway::send::GatewaySenders<I>::get::{{closure}}
2024-10-05T03:58:28.723246 -    8: tokio::runtime::task::core::Core<T,S>::poll
2024-10-05T03:58:28.723490 -    9: tokio::runtime::task::harness::Harness<T,S>::poll
2024-10-05T03:58:28.723749 -   10: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
2024-10-05T03:58:28.723986 -   11: tokio::runtime::scheduler::multi_thread::worker::Context::run
2024-10-05T03:58:28.724235 -   12: tokio::runtime::context::runtime::enter_runtime
2024-10-05T03:58:28.724484 -   13: tokio::runtime::scheduler::multi_thread::worker::run
2024-10-05T03:58:28.724732 -   14: tokio::runtime::task::core::Core<T,S>::poll
2024-10-05T03:58:28.724974 -   15: tokio::runtime::task::harness::Harness<T,S>::poll
2024-10-05T03:58:28.725221 -   16: tokio::runtime::blocking::pool::Inner::run
2024-10-05T03:58:28.725462 -   17: std::sys_common::backtrace::__rust_begin_short_backtrace
2024-10-05T03:58:28.725705 -   18: core::ops::function::FnOnce::call_once{{vtable.shim}}
2024-10-05T03:58:28.725946 -   19: std::sys::pal::unix::thread::Thread::new::thread_start
2024-10-05T03:58:28.726193 -   20: start_thread
2024-10-05T03:58:28.726422 -   21: __clone3

The root cause for this is PRF computation blocking scheduler for too long, so it does not schedule Hyper task to respond to status requests from RC, or to accept data from another peer. While it deserves to be fixed (I believe @danielmasny was looking into why we trash CPU so badly in PRF), it is not OK to crash if that happens.

This change just does the plumbing to allow dedicated runtime to be provided for query executors.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 56.33803% with 31 lines in your changes missing coverage. Please review.

Project coverage is 93.54%. Comparing base (58e3a25) to head (67e3828). Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
ipa-core/src/query/executor.rs 32.50% 27 Missing :warning:
ipa-core/src/app.rs 20.00% 4 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1336 +/- ## ========================================== - Coverage 93.60% 93.54% -0.06% ========================================== Files 209 209 Lines 34274 34331 +57 ========================================== + Hits 32082 32116 +34 - Misses 2192 2215 +23 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

akoshelev commented 1 month ago

Can you check out why the tests are failing?

yea it does not seem relevant to this change. Nightly Rust seems to be broken again.

     Running unittests src/lib.rs (target/debug/deps/ipa_step_derive-f6e14ae7c576baea)
/home/runner/work/ipa/ipa/target/debug/deps/ipa_step_derive-f6e14ae7c576baea: error while loading shared libraries: libstd-f157c25fb2dbfbe0.so: cannot open shared object file: No such file or directory
error: test failed, to rerun pass `-p ipa-step-derive --lib`

Caused by:
  process didn't exit successfully: `/home/runner/work/ipa/ipa/target/debug/deps/ipa_step_derive-f6e14ae7c576baea` (exit status: 127)

It is failing on other PRs as well - https://github.com/private-attribution/ipa/actions/runs/11221190944/job/31190994974. I can't find the issue in Rust repo that matches this problem, probably it is a fairly fresh breakage

akoshelev commented 1 month ago

I want to take another look at the shape of this API to make sure it is ergonomic enough to use, so this should be a draft for now

akoshelev commented 1 month ago

I am confident now that this API works, will be submitting a follow up PR soon