open-telemetry / opentelemetry-java

OpenTelemetry Java SDK
https://opentelemetry.io
Apache License 2.0
1.96k stars 806 forks source link

Any way to reduce performance overhead in DefaultSynchronousMetricStorage? #6601

Open wgy035 opened 1 month ago

wgy035 commented 1 month ago

Is your feature request related to a problem? Please describe. While conducting a performance test, I discovered through flame graph analysis that DefaultSynchronousMetricStorage.record() causes extra performance overhead. image

Is there any way to reduce performance overhead in DefaultSynchronousMetricStorage? Describe the solution you'd like 1.The process of AdviceAttributesProcessor seems only remove some attributes from input and it cost a lot, could we skip the remove process? 2.CollectionAggregatorHandles used ConcurrentHashMap, the performance is suboptimal in high concurrency scenarios, consider using a Lock-Free Queue instead, such as Disruptor or RingBuffer.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

jack-berg commented 1 month ago

The AdviceAttributesProcessor is involved when a measurement is recorded to an instrument which is using the (experimental) setAttributesAdvice(List<AttributeKey<?>> attributes) (e.g. ExtendedDoubleHistogramBuilder#setAttributesAdvice.

If setAttributesAdvice is not set, there is no AdviceAttributesProcessor and the overhead is reduced. I wrote a detailed blog post about the otel java metrics implementation, and suffice it to say a lot of performance engineering has been done and there isn't much room for improvement.

So if setAttributesAdvice is causing overhead, why is it being called. It looks like you're using the otel java agent. The setAttributesAdvice API serves a key use case for the agent in that it allows users to opt-in to additional attributes besides the default set which is recorded. For example, the http.server.request.duration metric defines a bunch of safe, low cardinality attributes to include on every measurement. But some users want to opt-in to additional higher cardinality attributes which are normally reserved for http spans. The otel java agent instrumentation records measurements with all the attributes (both high and low cardinality), and uses the setAttributesAdvice API to instruct the otel metrics SDK that only a subset of them should be retained by default. Users can override this by configuring views to select additional attributes.

But the consequence of this is that the otel metrics SDK has to do additional work for every measurement. Measurements come in with all attributes, and the SDK needs to trim that attributes set down to the subset retained for aggregation. This takes CPU and memory. There may be ways to optimize AdviceAttributesProcessor, but I suspect they'll be minor improvements.

trask commented 1 month ago

@wgy035 how does this compare to the overhead of everything under io/opentelemetry/javaagent/?

github-actions[bot] commented 1 month ago

This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment.

wgy035 commented 1 month ago

@jack-berg Thanks for your detailed response! I currently understand the purpose of AdviceAttributesProcessor, but in the case of java agent usage, I think that the choice of additional attributes should be defined in Instrumenter.

I've observed that otel java agent has already traversed all attributes in the process of OperationListener.onEnd before calling the SDK(e.g. HttpServerMetrics#onEnd), so I think introducing AdviceAttributesProcessor may be an extra operation and we can remove attributes that users don't need in OperationListener.onEnd, just as what is done in AdviceAttributesProcessor.

@trask In my performance test, it accounts for 9% of the total usage under io/opentelemetry/javaagent/

jack-berg commented 1 month ago

Opened #6629 as a minor optimization to exit early when advice doesn't end up removing any attributes. This still doesn't help the common case of when advice does result in attributes being removed.