[Feature]: Metrics - Bound instrument to achieve extremely high performance

cijothomas commented 10 months ago

Describe the solution you'd like:

Majority of the time is spent in locating the time-series to aggregate to. If the key and value are known ahead of time, we can avoid this lookup. This used to exist in the spec as "bound instrument". but was removed from initial stable spec, due to lack of demand for such a feature.

https://github.com/open-telemetry/opentelemetry-specification/blob/12b5b27d4b6a4c0197560f508f0be3c38d190dc4/specification/metrics/old_api.md#bound-instrument-calling-convention

Considered Alternatives

I am not aware of any other solution. Based on SDK benchmarks in OTel Rust and other languages (.NET in particular was used for comparison), the only way to really get blazing fast perf is by avoiding the manipulation/massaging of attributes+lookup of timeseries.

https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-sdk/benches/metric.rs#L82-L87 shows that highest perf is achieved for no attributes, and perf decreases significantly with each attribute.

Additional Context

No response

cijothomas commented 10 months ago

Similar benchmark from OTel.NET metrics, showing the same conclusion : https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Metrics/MetricsBenchmarks.cs#L34-L36

jtescher commented 10 months ago

Worth re-opening a spec issue? Seems like all language implementations will run into this same problem with the current APIs

cijothomas commented 10 months ago

Will open spec issue for gauging interest from others. I am not very sure if every other language has such rigorous perf needs, so a likely outcome is OTel Rust offering this under an opt-in feature flag to begin with. And then trying to push this through spec.

lalitb commented 10 months ago

Agree. I feel there is value in exploring and implementing this feature at-least for the languages used in high-throughout, low-latency environments. I will open a similar support request for otel-cpp. Meanwhile, good to open issue for specs to initiate the discussion.

KallDrexx commented 10 months ago

I'm going to start exploring adding bound instrumentation support to the rust library, at least to give us a fast path for high throughput metric paths with relatively stable attribute sets.

For our application when we generate a counter, that counter is always incremented with a stable set of dynamic attributes. Those attributes can change for a particular counter, but it's infrequent for the life of that instance of a counter (and once it changes, it won't change for a while after).

As a data point, under a load test of our application

11.9% of the application's work is in SyncUpDownCounter<T>::add()
6.9% of the global application time is spent in AttributeSet::from() due to that add call. 80% of the from() time is mostly hashing the iterator, and 20% is building the Btree
4.66% of the application's time is spent calling Measure::call() due to add(). Almost all of that cost is from ValueMap<T>::measure hashing the attribute set for the hashmap lookup, and then Btree equality (presumably for collision assurance).

I did a quick experiment replacing AttributeSet from holding a Btree to a vec instead with no sorting, and that brought UpDownCounter::add() down to 4.7%. I'm not sure how much sorting logic will add (I plan to experiment with that).

I also did an experiment of creating a wrapper around Counter<T> and UpDownCounter<T> that aggregates values and only flushes to opentelemetry either after every 10 seconds, or if the attributes for that counter instance change. That brought down the add() times down to 2.3%, and Measure::call() times down to 1.01%.

So both of these lead me to conclude that implementing the bounded instrument API can lead to performance gains. By pre-registering the attribute sets we will only incur the AttributeSet::from() cost once. Likewise, if the ValueMap<T> stores the time series its tracking in a Vec<T> instead of a hashmap, each AttributeSet<T> can then be given a Arc<> that internally points to an index in that vector. This means each Add() call for a bounded instrument has no hashing cost, and incrementing the value is a simple vector index lookup + sum (we don't need cardinality checks either since we know it exists in this case).

This may cause a slight performance impact on ad-hoc attribute sets, so I'll make sure to track that to make sure that time is still acceptable.

cijothomas commented 10 months ago

Independent of bound-instrument exploration, we have other optimizations that can help improve speed significantly - the AttributeSet de-duplication/sorting would be a significant cost (based on my measurements in other languages, likely same in rust too)!

open-telemetry / opentelemetry-rust