Open Gaganjuneja opened 1 year ago
@elfisher @muralikpbhat @reta @nknize @dblock @Bukhtawar @shwetathareja @backslasht
Thanks @Gaganjuneja , certainly +1 to this initiative:
Decide on the telemetry framework, OpenTelemetery looks the natural choice but need to deep dive and see if make sense for system like OpenSearch.
OpenTelemetery makes a lot of sense (from my perspective).
Sink/Exporter approach.
AFAIK the OpenTelemetery has a selection of reporters (at least for Java) to decide how the traces should be reported (dumped into logs, sent over network, ...). I think we should provide the configuration options here but not trying come up with own implementation (at least, initially).
Sampling approaches.
One of the options I would suggest is to have an explicit request level setting (trace
: true | false), it could be a query string / request payload / header. That would help to trace on demand (fe using curl
or alike) easily
The few areas we probably need to cover are:
Two of the most important design questions (for me) are to think about are:
+1 On plugin and extensions integration support. I feel security is an important aspect when we are dealing with thread context. Let's take a tenet on making the framework more secure. The RFC is still too high level to comment anything on how the actual interfaces or component level interactions might look like. I would atleast prefer to see those high level component interactions, what parts are pluggable, where could the instrumentation hooks be plumbed. Also that there are plans to move indexing/search/metadata pieces to extensions how would the interactions looks like.
- Co-located Sink – This type of sink will run in the same JVM (core OS) and keep on writing to the telemetry data store. Users will not have to maintain any additional components but may need to provide some additional resource for co-located sinks to run.
There might be isolation and security concerns around using the same JVMMP. But then that makes me think if this could be an extension in itself as extensions are agnostic to runtimes.
- Sidecar sink – This type of sink will run as a separate process on the same node. Tracing framework will push the traces to the sidecar sink through gRPC calls and the sidecar sink will write the data to the telemetry data store.
Let's not couple this with gRPC or any other framework yet. I would prefer support for mechanisms as simple as a disk of tmpfs
Thanks @reta and @Bukhtawar for your comments. Overall idea to put high level details was surfacing up the discussion. I am working on prototyping the end to end solution mainly around the context propagation across threads and across nodes. Also, looking at it from Plugins and extensions standpoint. I will share the details soon.
@Bukhtawar, For collector specifically will also publish the granular details, even while doing the deep dive realised that better to use tmpfs and then the collector to take over if needed.
This is a great proposal. I would love more information on use-cases that tracing will cover. Search is obvious. But I'd also like to be able to trace through changes such as number of shards.
This is a great proposal. I would love more information on use-cases that tracing will cover. Search is obvious. But I'd also like to be able to trace through changes such as number of shards.
Yes @dblock, We should be able to trace through all the operations.
@elfisher @muralikpbhat @reta @nknize @dblock @Bukhtawar @shwetathareja @backslasht I have updated the low level details in the here #7026. Please provide your inputs.
OpenSearch doesn’t have the capabilities to trace the request end to end with tasks level breakdown.
How do you define a request? What is it in your definition?
What is a task , used in task level break down?
How does this relate to tracing feature available in opensearch observability?
@Gaganjuneja
@anirudha, Thanks for your queries, I somehow missed replying. Please find below the answers, let me know if it doesn't make sense to you.
- How do you define a request? What is it in your definition?
OpenSearch supports mainly two customer-facing requests: Search and Indexing. There are multiple other requests as well such as cluster operation, settings, internal operations etc.
- What is a task , used in task level break down?
Task is a unit of work which can be executed independently. One request can be broken down to multiple tasks and finally the result will be collated from all these tasks. For example, One search request for a particular Index which contains 4 shards, will first execute the query phase for all the 4 shards with independent tasks. And once it's completed it will execute the fetch phase in multiple fetch tasks based on the query phase results. Finally the results from these fetch phase tasks will be collated and returned to the client as part of response. With tracing, we want to collect traces/spans from all the tasks and build the request level view. Please refer to this section for more details on request and tasks.
- How does this relate to tracing feature available in opensearch observability?
Are you talking about Trace Analytics? Here we will be generating the traces from OpenSearch core which can be ingested to Trace Analytics for further analytics.
@khushbr, Thanks for putting this up. #7352 describes the Sink component in detail. @Bukhtawar
hi It looks similar to this Prometheus OpenSearch plugin for collecting performance info: https://github.com/aiven/prometheus-exporter-plugin-for-opensearch
Hi @Gaganjuneja
Are you talking about Trace Analytics? Here we will be generating the traces from OpenSearch core which can be ingested to Trace Analytics for further analytics.
Are you referring to the OTEL base pipeline that has an OpenTelemetry receiver to collect Observability signals and ship them back to the OTEL pipeline ?
This is expressed and shown in our OpenSearch OTEL demo application
Actually why should we not just contribute an OpenSearch receiver under the OpenTelemetry contrib repository - receiver folder ?
In addition we already are already working on a contribution to the OTEL contrib repository for the OpenSearch exporter element
Is your feature request related to a problem? Please describe. Feature Request #1061
Describe the solution you'd like
Problem Statement
OpenSearch doesn’t have the capabilities to trace the request end to end with tasks level breakdown. It provides some limited support for X-Opaque-Id where clients can pass this id and the same will be returned in the response, it is also being used in the deprecation logger, etc. For deeper analysis and debugging, OpenSearch requires an extensive tracing to locate the resource utilization and the code paths that are hot or consuming more resources.
Tenets
Tracing Framework
It’s a well-known fact that tracing comes with a cost and high throughput systems like OpenSearch this cost may be huge. So, we need to be very cautious while designing the tracing framework for OpenSearch. Tracing framework will provide the abstractions, governance, utilities, context propagation techniques/guidance, sampling options. So that developers/users can simply use the tracing framework and emit the traces. Let’s discuss these points in detail –
High Level Design
There are 3 major components of tracing framework -
We will provide the support for both and use these in combinations. Like may be for search queries head based make more sense where we can filter out the traces for same query and tail based make sense for background tasks etc.
Next Steps
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.