Open akoshelev opened 1 year ago
I started investigating this issue. It appears the metrics crate does provide some ability to separate the registration of a metric (expensive, because it requires a global lock) from value updates for the metric (ought to be cheap). Specifically, instead of using the convenience macros like increment_counter!
, code can instead call register_counter()
and counter.increment()
separately.
Unfortunately, just doing that change in the straightforward way (move metric registration from InstrumentedIndexedSharedRandomness::generate_values
to InstrumentedIndexedSharedRandomness::new
) doesn't help us very much, because we also call new()
repeatedly in the inner loop.
I also noticed that, while not as prominent in the profile as the metrics are right now, the DashMap-based get_sender
and get_receiver
functions are taking a non-negligible amount of time.
Unfortunately, just doing that change in the straightforward way (move metric registration from InstrumentedIndexedSharedRandomness::generate_values to InstrumentedIndexedSharedRandomness::new) doesn't help us very much, because we also call new() repeatedly in the inner loop.
there is a centralized place where it is required to pre-register all metrics: https://github.com/private-attribution/ipa/blob/0003ddaa09d4f9497175424456b756a0ddaba784/src/telemetry/mod.rs#L56. We could probably collect these describe
calls into a table and stick it inside the context (inner).
I also noticed that, while not as prominent in the profile as the metrics are right now, the DashMap-based get_sender and get_receiver functions are taking a non-negligible amount of time.
yep, each send/receive operation checks whether there is an entry in this table. Dashmap sharding helps but it is likely not sufficient in this case. Thread-local cache could be a mitigation here. I am also thinking about pinning steps to threads for data locality purposes - that eliminates the need for locking. I don't know yet how to implement that though in our code base (SPSC and task per step?).
OPRF-friendly attribution when run on multi-core machine and leveraging more than one CPU core spends a lot of time inside the metrics crate. I looked inside the code a bit and it currently costs us dearly.
I ran IPA for 100k rows with parallel execution enabled through
SequentialFutures
on three actual machines. The performance impact of metrics is horrendouswith metrics enabled:
without metrics:
Root cause
DebuggingRecorder
used in IPA uses globally synchronized map to store counters and keys seen by it - code. As we emit metrics for every send/receive operations, it costs us the whopping 50% of execution time because threads contend on the same lock.It wasn't such a big deal before, when we had only one thread running the application, with thread-per-cpu-core it is quite evident that this lock is a problem.
Solutions
There are a couple of avenues to explore:
snapshot
). Using it as is will break our metrics. Maybe there is a way to collect metrics from all running threads?Ideally, metrics engine should allow collecting metrics at any time when program is running, but for now we can live with one-time collection, at the end if that makes our life easier.