private-attribution / ipa

A raw implementation of Interoperable Private Attribution
MIT License
42 stars 25 forks source link

Reduce thread contention on metrics #798

Open akoshelev opened 1 year ago

akoshelev commented 1 year ago

OPRF-friendly attribution when run on multi-core machine and leveraging more than one CPU core spends a lot of time inside the metrics crate. I looked inside the code a bit and it currently costs us dearly.

flamegraph-linux-par

I ran IPA for 100k rows with parallel execution enabled through SequentialFutures on three actual machines. The performance impact of metrics is horrendous

with metrics enabled:

Running IPA for 100000 records took 661.851748826s

without metrics:

Running IPA for 100000 records took 345.480439625s

Root cause

DebuggingRecorder used in IPA uses globally synchronized map to store counters and keys seen by it - code. As we emit metrics for every send/receive operations, it costs us the whopping 50% of execution time because threads contend on the same lock.

It wasn't such a big deal before, when we had only one thread running the application, with thread-per-cpu-core it is quite evident that this lock is a problem.

Solutions

There are a couple of avenues to explore:

Ideally, metrics engine should allow collecting metrics at any time when program is running, but for now we can live with one-time collection, at the end if that makes our life easier.

andyleiserson commented 1 year ago

I started investigating this issue. It appears the metrics crate does provide some ability to separate the registration of a metric (expensive, because it requires a global lock) from value updates for the metric (ought to be cheap). Specifically, instead of using the convenience macros like increment_counter!, code can instead call register_counter() and counter.increment() separately.

Unfortunately, just doing that change in the straightforward way (move metric registration from InstrumentedIndexedSharedRandomness::generate_values to InstrumentedIndexedSharedRandomness::new) doesn't help us very much, because we also call new() repeatedly in the inner loop.

I also noticed that, while not as prominent in the profile as the metrics are right now, the DashMap-based get_sender and get_receiver functions are taking a non-negligible amount of time.

akoshelev commented 1 year ago

Unfortunately, just doing that change in the straightforward way (move metric registration from InstrumentedIndexedSharedRandomness::generate_values to InstrumentedIndexedSharedRandomness::new) doesn't help us very much, because we also call new() repeatedly in the inner loop.

there is a centralized place where it is required to pre-register all metrics: https://github.com/private-attribution/ipa/blob/0003ddaa09d4f9497175424456b756a0ddaba784/src/telemetry/mod.rs#L56. We could probably collect these describe calls into a table and stick it inside the context (inner).

I also noticed that, while not as prominent in the profile as the metrics are right now, the DashMap-based get_sender and get_receiver functions are taking a non-negligible amount of time.

yep, each send/receive operation checks whether there is an entry in this table. Dashmap sharding helps but it is likely not sufficient in this case. Thread-local cache could be a mitigation here. I am also thinking about pinning steps to threads for data locality purposes - that eliminates the need for locking. I don't know yet how to implement that though in our code base (SPSC and task per step?).