[RFC] Add metrics and tracing framework in Opensearch

itiyamas commented 3 years ago

We have been looking into instrumenting the Opensearch code recently. Even though stats provides a good mechanism, it loses a lot of details like percentiles, which makes it really harder to debug issues in production. Wouldn’t it be great to add a metrics framework in Opensearch that allows a developer to add metrics easily at any part of the code without having to know the exact stats object where the metric belongs?

The framework can, for example, be integrated with RestActions and emit timing and error metrics per operation by default. Similarly, we could pass around this metrics object via ThreadContext down the executor chain and correlate timing metrics together in a single block per request. The metrics can have different logging levels allowing us to skip or add metric calculation on the fly- similar to what we have in logger.

Imagine that you added a new action within the bulk API chain and suddenly a few more requests start taking more time. One of the ways of achieving this is by adding a stat for the new operation within bulk stats. But because stats are always averaged or use precomputed percentiles - it is really tricky to confirm whether the new operation is the culprit. If there is a single metrics block that allows us correlating these metrics- it would be really simple to determine causation.

Now that I have talked about metric generation framework, the publishing can be implemented in a pluggable fashion to different sinks. We can provide a default implementation for the metric log file format, which can be plugged in via different metrics plugins.

jkowall commented 3 years ago

Sounds a bit overengineered to me. I would implement OpenTelemetry which can handle the metrics, but also help with tracing (logging is coming soon). The problem with "metrics" in OpenSearch today is that they are exposed as JMX, which is highly inefficient versus modern ways of exposing data as an endpoint (in the case of Prometheus) and allowing metrics to be scraped and stored easily. Similarly I am sure there are good reasons in OpenSearch when you might want to do some tracing too.

itiyamas commented 3 years ago

+1 on OpenTelemetry and tracing!

Opensearch does not provide hooks to collect detailed request level metrics/tracing easily. Once we have those hooks, we can integrate OpenTelemetry for metric collection, tracing etc. I don't think it is possible to add fine grained metrics in existing code with JMX. I will check it out though.

Bukhtawar commented 3 years ago

Breaking down the problem

Per action slow execution can also be traced by #1021
We need more stats as to supplement existing metrics with various interaction layers like time spent in the queue, N/W round trip, IO time, blocked time etc, these would be aggregated over requests spanning a time frame and easily queryable over an endpoint
Per request id based latency breakdown and distributed tracing

jkowall commented 3 years ago

Per request id based latency breakdown and distributed tracing

If you have ever tried to instrument ElasticSearch you will learn this is a really bad idea, especially per request. It might be useful for debugging, but generally, the data will make no sense. I have done this in the past.

itiyamas commented 3 years ago

@jkowall Why do think request based latencies are bad? Is it bad in general or just for Opensearch or just for cases where throughput is really high? I can think of cases where long running requests would hold the metrics object in memory, leading to increasing JVM, but that data is really low compared to the request data/partial response data itself. Memory CB calculations can be based on metrics data to stop collecting metrics based on user setting. Next comes performance degradation due to extra work needed to collect and publish metrics. That is a trade-off any user can take by changing the instrumentation log level per logger instance. Additionally, if the throughput is really high, the system can provide an option to switch to in memory aggregates or request sampling, but I won't prematurely optimize this as the cost of skipping observability is pretty high too.

The way we collect stats in Opensearch gives us averages, which ends up hiding lot of issues. Request based latencies helps us track outliers easily. Aggregating metrics early on leaves me with very little information to troubleshoot the issues after they have already occurred. The kind of clusters I deal with on a daily basis, this information is really important and often I end up enabling trace logging after the fact and then wait for the issue to re-occur, which sometimes never re-occurs or trace logs don't have enough information. Reading the trace logs is pretty tedious at this point and doing it across requests is a nightmare.

CEHENKLE commented 2 years ago

Heya @itiyamas What are you thinking next steps will be for this?

jkowall commented 2 years ago

My challenge to your suggestion is that tracing on ElasticSearch is very challenging. When you install autoinstrumentation and collect traces they will make no sense at all. I have done this with several tools and the data was useless. Additionally the overhead of instrumentation was performance impacting.

If you want to collect metrics or collect response data that would be more reasonable. We actually already have something similar that @AmiStrn worked on around the profiler that exists already.

reta commented 2 years ago

@jkowall @itiyamas @Bukhtawar trying to summarize the discussion points and possible future developments on the subject of metrics / tracing / stats: 1) Metrics. The existing OpenSearch stats APIs provide a lot of insights into the internals of the each node and cluster in general. But the stats lack the semantics of rich metric types (fe histograms, ...) so it is difficult to catch outliers. Integration with OpenTelemetry is an option (alternatively, Micrometer / Dropwizard Metrics / ...). 2) Distributed Tracing. The OpenSearch has limited support of corelation ID propagation (X-Opaque-Id), which works only with some clients. It is propagated up to the task level but the breakdown is not captured or persisted anywhere for analysis or visualization. Integration with OpenTracing is an option (the autoinstrumentation is out of the picture, manual instrumentation / context propagation is required). 3) JFR. The OpenSearch server could be instrumented to emit context-specific JFR events. The overhead of the JFR instrumentation is very low (<3%), but the kinds and types of the events are subject to discussion. JFR events are local to the node but could be streamed to centralized location as well. 4) Stats improvements (a large chunk of improvements is included into https://github.com/opensearch-project/OpenSearch/issues/1042).

Does it make sense to create RFCs for Metrics / Tracing / JFRs and at least run some experiments to understand a) how useful that would be? b) how difficult that would be?

Thoughts?

jkowall commented 2 years ago

@reta OpenTracing is deprecated, it should use OpenTelemetry if anything, but yes, I agree that autoinstrumentation is not a good idea and manually adding the code could add overhead depending on where you instrument in the code.

I agree that focusing on Metrics and Stats are a better approach. @AmiStrn was working on this earlier in the project, but we switched to other work when we realized that the governance for OpenSearch was not going to include other companies outside of AWS. When this changes then we might contribute to core features to make the project better in general.

opensearch-project / OpenSearch

[RFC] Add metrics and tracing framework in Opensearch #1061