Reduce thread contention on metrics

akoshelev commented 1 year ago

OPRF-friendly attribution when run on multi-core machine and leveraging more than one CPU core spends a lot of time inside the metrics crate. I looked inside the code a bit and it currently costs us dearly.

flamegraph-linux-par

I ran IPA for 100k rows with parallel execution enabled through SequentialFutures on three actual machines. The performance impact of metrics is horrendous

with metrics enabled:

Running IPA for 100000 records took 661.851748826s

without metrics:

Running IPA for 100000 records took 345.480439625s

Root cause

DebuggingRecorder used in IPA uses globally synchronized map to store counters and keys seen by it - code. As we emit metrics for every send/receive operations, it costs us the whopping 50% of execution time because threads contend on the same lock.

It wasn't such a big deal before, when we had only one thread running the application, with thread-per-cpu-core it is quite evident that this lock is a problem.

Solutions

There are a couple of avenues to explore:

There is a per_thread initializer that does not have this problem. However, it only makes it possible to collect metrics from one thread (that called snapshot). Using it as is will break our metrics. Maybe there is a way to collect metrics from all running threads?
Another option is to rewrite the recorder to provide the functionality we need: (emit metrics with minimal contention, collect all of them when program finishes).
finally, we don't have to stick to the metrics crate. This is the least preferable option, but we can do it too if we can't squeeze the performance we need out of this crate.

Ideally, metrics engine should allow collecting metrics at any time when program is running, but for now we can live with one-time collection, at the end if that makes our life easier.

andyleiserson commented 1 year ago

I started investigating this issue. It appears the metrics crate does provide some ability to separate the registration of a metric (expensive, because it requires a global lock) from value updates for the metric (ought to be cheap). Specifically, instead of using the convenience macros like increment_counter!, code can instead call register_counter() and counter.increment() separately.

Unfortunately, just doing that change in the straightforward way (move metric registration from InstrumentedIndexedSharedRandomness::generate_values to InstrumentedIndexedSharedRandomness::new) doesn't help us very much, because we also call new() repeatedly in the inner loop.

I also noticed that, while not as prominent in the profile as the metrics are right now, the DashMap-based get_sender and get_receiver functions are taking a non-negligible amount of time.

akoshelev commented 1 year ago

Unfortunately, just doing that change in the straightforward way (move metric registration from InstrumentedIndexedSharedRandomness::generate_values to InstrumentedIndexedSharedRandomness::new) doesn't help us very much, because we also call new() repeatedly in the inner loop.

there is a centralized place where it is required to pre-register all metrics: https://github.com/private-attribution/ipa/blob/0003ddaa09d4f9497175424456b756a0ddaba784/src/telemetry/mod.rs#L56. We could probably collect these describe calls into a table and stick it inside the context (inner).

I also noticed that, while not as prominent in the profile as the metrics are right now, the DashMap-based get_sender and get_receiver functions are taking a non-negligible amount of time.

yep, each send/receive operation checks whether there is an entry in this table. Dashmap sharding helps but it is likely not sufficient in this case. Thread-local cache could be a mitigation here. I am also thinking about pinning steps to threads for data locality purposes - that eliminates the need for locking. I don't know yet how to implement that though in our code base (SPSC and task per step?).

private-attribution / ipa

Reduce thread contention on metrics #798

Root cause

Solutions