[RFC] Distributed Tracing

Gaganjuneja commented 1 year ago

Is your feature request related to a problem? Please describe. Feature Request #1061

Describe the solution you'd like

Problem Statement

OpenSearch doesn’t have the capabilities to trace the request end to end with tasks level breakdown. It provides some limited support for X-Opaque-Id where clients can pass this id and the same will be returned in the response, it is also being used in the deprecation logger, etc. For deeper analysis and debugging, OpenSearch requires an extensive tracing to locate the resource utilization and the code paths that are hot or consuming more resources.

Tenets

Minimal Overhead – Tracing shouldn’t add more than 1% of overhead on system resources like CPU and memory. We should prefer minimal tracing to reduce the overhead.
No performance impact – Tracing shouldn’t add any performance impact to cluster operations.
Extensible – Tracing framework should be easily extendable for background tasks, plugins and extensions.
Safety - Framework should ensure that there is no memory leak in the instrumentation code.
Well defined Abstractions – Tracing frameworks should have well defined abstractions so that in case of implementation framework upgrades, API contract changes, implementing classes shouldn’t be impacted.
Flexible – It should have the mechanism to enable/disable tracings at different granularity such as levels, components (search, indexing, etc.), etc through dynamic settings.
Meaningful/Purposeful – Framework should have the mechanism to avoid exploitation of tracing.
Clear boundaries – Framework should define the clear boundaries of instrumentation generation and sinks.

Tracing Framework

It’s a well-known fact that tracing comes with a cost and high throughput systems like OpenSearch this cost may be huge. So, we need to be very cautious while designing the tracing framework for OpenSearch. Tracing framework will provide the abstractions, governance, utilities, context propagation techniques/guidance, sampling options. So that developers/users can simply use the tracing framework and emit the traces. Let’s discuss these points in detail –

Abstractions – We can use tracing frameworks such as Open Telemetry. But we will abstract out the tracing solution behind the OpenSearch APIs so that in the future we need not change the core OpenSearch code where the actual instrumentation is being done. We would be very cautious defining the abstractions so that we shouldn’t be doing the over engineering as well.
Sampling – We will provide the support for both head and tail based samplings, it will be configurable or decided based on the need. Framework would provide the sampling implementations, which can be configured at a component level. Like sampling requirements for search requests could be different from the indexing requests and background jobs.
1. Head Based - Head-based sampling collects and stores trace data randomly. Typically, head-based sampling happens within the agent responsible for collecting trace telemetry by randomly selecting which traces to sample for analysis. We can customize the selection logic to sample based on the request type/query type etc. Downside of this sampling technique is that we can miss out on the slow requests.
2. Tail Based - Tail-based sampling collects all information about that trace when it’s complete but stores the samples only. This is helpful in reducing the load to data stores and even pressure on in-memory data sources. This gives us the opportunity to evaluate the trace before dropping and can be a great use for our use cases. Let’s say if we can configure the minimum threshold and traces taking time lower than that can be filtered out.
Governance
1. Framework will provide the dynamic tracing enable/disable options at different levels like components, levels, entire framework, etc.
2. It will also define the ways instrumentation can be done such as task level, threadpool level, Listeners Support, etc.
3. Management of tracing objects.
4. Tracing Levels – Like logging levels, we will introduce the tracing levels so that in case of issues we can disable/enable the tracing based on levels. i.e. Calculating CPU/Memory consumption by a method is a heavy operation so we should be able to enable this as and when needed. It will also help developers/users to think about the need and exact level of their tracing so we shouldn’t be bloating the system with unnecessary traces.
Code pollution – With all this ease of instrumentation, it's highly likely that it will get exploited for unnecessary tracing with lots of boilerplate code.
1. Avoid extensive tracing
  1. Component Level Management – At each component level we need to define the tracing hooks so that we avoid polluting the actual code where core logic is written. For example, instrumenting search requests we can start tracing from the parent task and pass the context till SearchOperationListener to instrument the major search phase entry/exit points. There will also be a provision to get the open tracing spans by some deterministic id and then do the more granule tracing where ever needed.
  2. To highlight the explicit tracing we can come up with some annotation like @EnableTracing so that code reviewers can make sure that it’s absolutely needed. We can add a plugin in the code build that any tracing without this annotation will lead to the build failure.
2. Boilerplate code - Framework will make sure that users only need to write just 1 or 2 lines of recording code like logger to reduce the boilerplate code. We can come up with easy to use annotations but runtime annotations might be a little costly.

High Level Design

There are 3 major components of tracing framework -

Tracing APIs – Tracing APIs will provide the well-defined tracing interfaces/APIs. It will implement most of the heavy lifting or boilerplate code and users will just be able to make a simple call to record the trace. Tracing APIs will clearly abstract out the implementation and users need not to be worried about the underlying framework (maybe OpenTelemetry) or api version changes. APIs should be able to take care of the most of attributes that need to be captured and level them according to their resource footprint.
Bounded In-memory Buffer - All the traces will be recorded at a parent span level in memory and will be flushed out periodically to the sink. (OPEN – how can we limit the memory usage for the tracing solution)
Sink - All the flushed traces will be written to any configurable telemetry data store of the user's choice. Sink should also be able to sample the traces before writing to the data store. Sink will lose the traces in case there is some backlog getting built up to reduce the pressure on in-memory buffers. Users can select between the following 2 types of Sinks.
1. Co-located Sink – This type of sink will run in the same JVM (core OS) and keep on writing to the telemetry data store. Users will not have to maintain any additional components but may need to provide some additional resource for co-located sinks to run.
2. Sidecar sink – This type of sink will run as a separate process on the same node. Tracing framework will push the traces to the sidecar sink through gRPC calls and the sidecar sink will write the data to the telemetry data store.

We will provide the support for both and use these in combinations. Like may be for search queries head based make more sense where we can filter out the traces for same query and tail based make sense for background tasks etc.

Next Steps

Decide on the telemetry framework, OpenTelemetery looks the natural choice but need to deep dive and see if make sense for system like OpenSearch.
Prototyping the trace creation and context propagation along the request.
Sampling approaches.
Sink/Exporter approach.
Levels of tracing

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Gaganjuneja commented 1 year ago

@elfisher @muralikpbhat @reta @nknize @dblock @Bukhtawar @shwetathareja @backslasht

reta commented 1 year ago

Thanks @Gaganjuneja , certainly +1 to this initiative:

Decide on the telemetry framework, OpenTelemetery looks the natural choice but need to deep dive and see if make sense for system like OpenSearch.

OpenTelemetery makes a lot of sense (from my perspective).

Sink/Exporter approach.

AFAIK the OpenTelemetery has a selection of reporters (at least for Java) to decide how the traces should be reported (dumped into logs, sent over network, ...). I think we should provide the configuration options here but not trying come up with own implementation (at least, initially).

Sampling approaches.

One of the options I would suggest is to have an explicit request level setting (trace: true | false), it could be a query string / request payload / header. That would help to trace on demand (fe using curl or alike) easily

The few areas we probably need to cover are:

testing: provide the necessary test scaffolding to test tracing is working as expected
clients instrumentation (that would need to include the whole fleet of clients, not only Java)
plugins awareness: the plugins should be aware of the tracing context (for a number of reasons)
extensions awareness: the extensions should be aware of the tracing context (for a number of reasons)

Two of the most important design questions (for me) are to think about are:

the tracing context propagation across thread boundaries taking into account the future where the Project Loom is not in preview anymore
filling the gaps in Apache Lucene, there are places where threads / pools are used internally, without easy ways to hook in

Bukhtawar commented 1 year ago

+1 On plugin and extensions integration support. I feel security is an important aspect when we are dealing with thread context. Let's take a tenet on making the framework more secure. The RFC is still too high level to comment anything on how the actual interfaces or component level interactions might look like. I would atleast prefer to see those high level component interactions, what parts are pluggable, where could the instrumentation hooks be plumbed. Also that there are plans to move indexing/search/metadata pieces to extensions how would the interactions looks like.

Co-located Sink – This type of sink will run in the same JVM (core OS) and keep on writing to the telemetry data store. Users will not have to maintain any additional components but may need to provide some additional resource for co-located sinks to run.

There might be isolation and security concerns around using the same JVMMP. But then that makes me think if this could be an extension in itself as extensions are agnostic to runtimes.

Sidecar sink – This type of sink will run as a separate process on the same node. Tracing framework will push the traces to the sidecar sink through gRPC calls and the sidecar sink will write the data to the telemetry data store.

Let's not couple this with gRPC or any other framework yet. I would prefer support for mechanisms as simple as a disk of tmpfs

Gaganjuneja commented 1 year ago

Thanks @reta and @Bukhtawar for your comments. Overall idea to put high level details was surfacing up the discussion. I am working on prototyping the end to end solution mainly around the context propagation across threads and across nodes. Also, looking at it from Plugins and extensions standpoint. I will share the details soon.

@Bukhtawar, For collector specifically will also publish the granular details, even while doing the deep dive realised that better to use tmpfs and then the collector to take over if needed.

dblock commented 1 year ago

This is a great proposal. I would love more information on use-cases that tracing will cover. Search is obvious. But I'd also like to be able to trace through changes such as number of shards.

Gaganjuneja commented 1 year ago

This is a great proposal. I would love more information on use-cases that tracing will cover. Search is obvious. But I'd also like to be able to trace through changes such as number of shards.

Yes @dblock, We should be able to trace through all the operations.

Gaganjuneja commented 1 year ago

@elfisher @muralikpbhat @reta @nknize @dblock @Bukhtawar @shwetathareja @backslasht I have updated the low level details in the here #7026. Please provide your inputs.

anirudha commented 1 year ago

OpenSearch doesn’t have the capabilities to trace the request end to end with tasks level breakdown.

How do you define a request? What is it in your definition?
What is a task , used in task level break down?
How does this relate to tracing feature available in opensearch observability?

@Gaganjuneja

Gaganjuneja commented 1 year ago

@anirudha, Thanks for your queries, I somehow missed replying. Please find below the answers, let me know if it doesn't make sense to you.

How do you define a request? What is it in your definition?

OpenSearch supports mainly two customer-facing requests: Search and Indexing. There are multiple other requests as well such as cluster operation, settings, internal operations etc.

What is a task , used in task level break down?

Task is a unit of work which can be executed independently. One request can be broken down to multiple tasks and finally the result will be collated from all these tasks. For example, One search request for a particular Index which contains 4 shards, will first execute the query phase for all the 4 shards with independent tasks. And once it's completed it will execute the fetch phase in multiple fetch tasks based on the query phase results. Finally the results from these fetch phase tasks will be collated and returned to the client as part of response. With tracing, we want to collect traces/spans from all the tasks and build the request level view. Please refer to this section for more details on request and tasks.

How does this relate to tracing feature available in opensearch observability?

Are you talking about Trace Analytics? Here we will be generating the traces from OpenSearch core which can be ingested to Trace Analytics for further analytics.

Gaganjuneja commented 1 year ago

@khushbr, Thanks for putting this up. #7352 describes the Sink component in detail. @Bukhtawar

YANG-DB commented 1 year ago

hi It looks similar to this Prometheus OpenSearch plugin for collecting performance info: https://github.com/aiven/prometheus-exporter-plugin-for-opensearch

YANG-DB commented 1 year ago

Hi @Gaganjuneja

Are you talking about Trace Analytics? Here we will be generating the traces from OpenSearch core which can be ingested to Trace Analytics for further analytics.

Are you referring to the OTEL base pipeline that has an OpenTelemetry receiver to collect Observability signals and ship them back to the OTEL pipeline ?

This is expressed and shown in our OpenSearch OTEL demo application

YANG-DB commented 1 year ago

Actually why should we not just contribute an OpenSearch receiver under the OpenTelemetry contrib repository - receiver folder ?

In addition we already are already working on a contribution to the OTEL contrib repository for the OpenSearch exporter element

opensearch-project / OpenSearch