Metric AggregatorStore improvements

cijothomas commented 3 years ago

Opening an issue to track future improvements.

The current AggStore uses 2 data structures to find the MetricPoint. One is an pre-allocated array of MetricPoint, where the actual points are stored, and then a concurrentdictionary (2 level), to find the index within the array for a given keys/values combinations.

The current design requires Sorting based on keys. Based on benchmarking, this is the biggest contributor to the hot path. One potential way to improve this is to add 2 entries to the dictionary - one with the 1st seen order of keys, and another with sorted order. If subsequent recordings are with same order as initial, it can be looked up without sort. If lookup fails, then need to sort and do another lookup. This might be good improvement, if user keeps providing the keys in same order.
Replace the inner concurrent dictionary with hashtable to avoid dual lock on writes - https://github.com/open-telemetry/opentelemetry-dotnet/pull/2339#discussion_r706997727
Design a totally new data structure specialized for the purpose.
Special case 0 tags - see discussion : https://github.com/open-telemetry/opentelemetry-dotnet/pull/2339#discussion_r711217242

Given this is purely internal implementation without any public API, this can be addressed after other high priority asks.

findmyway commented 3 years ago

The current design requires Sorting based on keys. Based on benchmarking, this is the biggest contributor to the hot path. One potential way to improve this is to add 2 entries to the dictionary - one with the 1st seen order of keys, and another with sorted order. If subsequent recordings are with same order as initial, it can be looked up without sort. If lookup fails, then need to sort and do another lookup. This might be good improvement, if user keeps providing the keys in same order.

I like this design. However, there might be a bug here. Let's say we add the first measurement of A with attributes of (c = 3, b=2, a=1), then another one B with attributes of (c=3, a=1, b=2). Based on the description above, the second lookup will fail. And then we sort the attributes of B and get (a=1,b=2,c=3). Note that it's not the same as the attributes of A, and we have to create a new metric here, which is obviously not what we want.

cijothomas commented 3 years ago

The current design requires Sorting based on keys. Based on benchmarking, this is the biggest contributor to the hot path. One potential way to improve this is to add 2 entries to the dictionary - one with the 1st seen order of keys, and another with sorted order. If subsequent recordings are with same order as initial, it can be looked up without sort. If lookup fails, then need to sort and do another lookup. This might be good improvement, if user keeps providing the keys in same order.

I like this design. However, there might be a bug here. Let's say we add the first measurement of A with attributes of (c = 3, b=2, a=1), then another one B with attributes of (c=3, a=1, b=2). Based on the description above, the second lookup will fail. And then we sort the attributes of B and get (a=1,b=2,c=3). Note that it's not the same as the attributes of A, and we have to create a new metric here, which is obviously not what we want.

Not sure I fully understood the bug. When (c = 3, b=2, a=1) is encountered, its lookup will fail. Then its sorted an another lookup is done, which also fails. Then both original order and sorted order is inserted to dictionary. Next, (c=3, a=1, b=2) comes. Its initial lookup fails. Then sorted lookup occurs, which succeeds.

(btw, this is not implemented yet)

findmyway commented 3 years ago

... Then both original order and sorted order is inserted to dictionary...

Now I see. Thanks for your explanation. I thought only the original order is inserted before.

cijothomas commented 2 years ago

@utpilla is looking at implementing the item 1 to avoid sorting costs.

cijothomas commented 2 years ago

The most critical improvements are already shipped as part of 1.2 itself. Removing this from 1.3 milestone, as it can occur anytime. (1.3 is releasing in May 2022 with other improvements)

stevejgordon commented 4 months ago

@cijothomas Do you think there's anything left to do here, or could this be closed now?

cijothomas commented 4 months ago

Nothing very specific is remaining so okay to close. (There is still some improvements for the aggregation logic for better throughput, but I'll open separate issue with specifics.)

open-telemetry / opentelemetry-dotnet

Metric AggregatorStore improvements #2374