[Feature Request] Ability to (filter) trace the bulk/search requests with high latency or resource consumption

backslasht commented 8 months ago

Is your feature request related to a problem? Please describe

With the introduction of Request Tracing Framework (RTF) using OpenTelemetry (OTel), requests can be traced to identify various code paths/modules which take more time to execute. While this solves the intended problem, enabling tracing for all requests does have an overhead with respect to additional computation/CPU and memory. The recommended solution to reduce the overhead is to sample the requests, OTel supports two types of sampling techniques, i) head based probabilistic sampling and ii) tail sampling.

While tailing sampling is good, it still requires the additional computation and memory as the optimization is only applicable for the network data transfer. Head based probabilistic sampling may not capture the requests with high latency if their occurrence is small (1 out of 10K requests).

This leaves us with either sample everything with additional cost or not sample at all.

Describe the solution you'd like

Threshold based trace capture: Given that both the head and tail based sampling doesn't solve the problem in an efficient way, we need a new way to identify if a particular request is important during the request processing and then capture all the related span/traces. This can be achieved by configuring thresholds for different spans and a request is considered important only when one/more spans have breached the threshold configured for those span(s). For example, say if we have a span on particular aggregation code, we can setup a threshold search.aggregation.date_histogram.latency > 300ms, then if any request which takes more than 300ms can then be captured. The threshold can be configured as a dynamic setting similar to the way log levels are configured today. This is one thought process, would like to get community's inputs on this.

Related component

Other

Describe alternatives you've considered

i) Head based probabilistic sampling. ii) Tail sampling.

Additional context

No response

peternied commented 8 months ago

[Triage - attendees 1 2 3 4 5 6] @backslasht Thanks for filing this issue, this seems like it could be related to other efforts such as sandboxing requests #11061 as well as other overlap with audit logging systems. Look forward to seeing more details come out of this discussion.

Adding @reta for more thoughts

reta commented 8 months ago

@backslasht I believe tail based sampling is the best option in this case since we cannot make the decision upfront regarding latency

While tailing sampling is good, it still requires the additional computation and memory as the optimization is only applicable for the network data transf

The tail base sampling would be done outside of OpenSearch (on Otel exporter side), so we should be not concerned too much by additional computation and memory I believe

andrross commented 8 months ago

@backslasht I don't fully understand how the threshold-based solution would work. Don't you have to capture all spans since you don't know at the start that they will breach the configured threshold? It's only when they complete (or even all spans within a trace complete) that you know the threshold wasn't breached and they are safe to discard. I'm confused about how this is different from tail-based sampling in practice.

Gaganjuneja commented 8 months ago

I think the ask here is to have delayed sampling decision (with in the context of a live request) because

Head sampling happens at very early stage.
Tail sampling is difficult in our case because 1)it requires all the spans related to a single request to be collected on single node and 2) in order to take a tail sampling decision in such scenarios we will have to sample 100% requests from the Head Sampling.

We need to see how it can be accommodated in the otel framework.

reta commented 8 months ago

2. 1)it requires all the spans related to a single request to be collected on single node and

Single span collector node, just to clarify here

2) in order to take a tail sampling decision in such scenarios we will have to sample 100% requests from the Head Sampling.

No all but the requests of specific types (actions), just to clarify on that

backslasht commented 8 months ago

The tail base sampling would be done outside of OpenSearch (on Otel exporter side), so we should be not concerned too much by additional computation and memory I believe

@reta - Tail based sampling works if the number of requests are in hundreds per second, but it becomes a bottleneck when the count grows to tens of thousands of requests per second. It puts additional pressure on OpenSearch to collect, process and export. Also, if the sampling is done outside the OpenSearch cluster, then it brings in additional resource consumption w.r.t network I/O and compression.

I'm confused about how this is different from tail-based sampling in practice.

@andrross - As @reta pointed out, in tail based sampling, the decision is taken much later, mostly in external systems as they get to see the full view of the request (coordinator nodes + N data nodes). What I suggest is a variation of that, where the decision to capture is taken by one or more of the data nodes and communicated back to the coordinator so that the corresponding spans (coordinator spans and the spans of the relevant data nodes) are captured. Though this doesn't provide the full view of the request, it does provide view of the problematic parts (spans) for the request which will help in debugging the issue.

andrross commented 8 months ago

@backslasht Got it, this does seem like a variation of tail-based sampling where essentially each component can make an independent tail decision about whether its span should be captured. It also seems like any communication/coordination between the different components to ensure a coherent trace gets captured when one of the spans makes the decision to capture will be a challenge here (though to be honest I don't know OpenTelemetry well).

reta commented 8 months ago

What I suggest is a variation of that, where the decision to capture is taken by one or more of the data nodes and communicated back to the coordinator so that the corresponding spans (coordinator spans and the spans of the relevant data nodes) are captured.

@backslasht I think it is clear that in order to support feature like that, the complete state of the trace (spread across the cluster nodes) has to be kept somewhere for a duration of the request. In the view of your next comment ...

@reta - Tail based sampling works if the number of requests are in hundreds per second, but it becomes a bottleneck when the count grows to tens of thousands of requests per second.

... keeping the traces state of tens of thousands of the requests on OpenSearch side in case any of them may backfire is unsound design decision from the start (at least, to me).

OpenSearch does not do any processing over spans - it merely collects them and sends over the wire (in batches). The overhead of that could be measured and accounted for, but it is very lightweight. There are many large high volume systems out there that do use tracing at scale, it needs infrastructure obviously but that is different problem. At the end, the hit will be taken by the collector that has to accommodate tail sampling requirement - it make the system more reliable (if collector dies or needs scaling, no visible impact for users).

Tail/Adaptive sampling is difficult problem to solve, I think we have to stay realistic and explore the limits of the existing systems before making any statements regarding how they behave or may behave. To my knowledge, we have not done any of that yet.

backslasht commented 8 months ago

... keeping the traces state of tens of thousands of the requests on OpenSearch side in case any of them may backfire is unsound design decision from the start (at least, to me).

I guess I was not very clear in my previous comment. The suggestion is not to keep the traces of completed requests. As per the design today, the traces are kept in memory (in coordinator nodes) when the requests are further getting processed in data nodes and will be written to the wire when the response is sent back. The proposal is an optimization where in OpenSearch can decide if the trace needs to be sent to the wire or not based on its significance and the significance is determined by a certain rule.

OpenSearch does not do any processing over spans - it merely collects them and sends over the wire (in batches). The overhead of that could be measured and accounted for, but it is very lightweight. There are many large high volume systems out there that do use tracing at scale, it needs infrastructure obviously but that is different problem. At the end, the hit will be taken by the collector that has to accommodate tail sampling requirement - it make the system more reliable (if collector dies or needs scaling, no visible impact for users).

I agree, we don't have benchmarks on the impact of collecting traces at large scale. I can go create large volumes of requests and measure the impact, but open for suggestions if you any thoughts on large scale workloads?

reta commented 8 months ago

As per the design today, the traces are kept in memory (in coordinator nodes) when the requests are further getting processed in data nodes and will be written to the wire when the response is sent back.

This is not how it works today (if by traces we mean the tracing instrumentation), the trace spans are flushed as they are ended, nothing is kept in memory (besides the contextual details which is basically trace / span ids)

but open for suggestions if you any thoughts on large scale workloads?

We have benchmarking framework, we could simulate workloads.

Gaganjuneja commented 8 months ago

This is not how it works today (if by traces we mean the tracing instrumentation), the trace spans are flushed as they are ended, nothing is kept in memory (besides the contextual details which is basically trace / span ids)

Yes, but in most of the cases parent span ends post child is ended (except some async scenarios).

backslasht commented 8 months ago

Yes, but in most of the cases parent span ends post child is ended (except some async scenarios).

Thanks! exactly the point.

nishchay21 commented 7 months ago

@backslasht Thank you for filing the issue. I do see the point for this feature and did a quick load test to see how just creating lot of spans can impact the overall cluster performance. With the load test we could see there was a significant increase of the node network bandwidth where the number of spans generated are high. This does make a significant difference and can impact the workload and operations.

Following are the tests which I conducted:

Test Environment Setup:

3 data nodes behind a load balancer, each with 32 GB ram [16Gb JVM] and 8 cores
Load - Http_logs
Bulk size 5 documents [Kept less to simulate high load] 
100% sampling done on requests

When traces are disabled : Tested the cluster with spans disabled and with approximately 200K requests mostly comprising of bulk and few searches, could see that the network bandwidth utilization with the same was around 190mb/s which was significantly low when compared to load when spans were enabled and no issues were seen on the network as we see in the image below

When the traces were enabled: With spans enabled tested the cluster with similar load of 200K requests mostly comprising of bulk and could see that the network bandwidth utilization with the same was around 330mb/s which is approximately 1.7 time more than the optimal network load on the cluster.

To further check if the bandwidth utilization is getting affected/worse with the count of spans I could see with 328K requests it further increased and even at a point it was utilizing the full network bandwidth of the node and a drop in spans were observed during the time frame. Not just spans, could also see overall node health check getting stalled and causing the nodes to get dropped and join the cluster back again.

So to overcome such behavior and to be fault tolerant I do agree there is a merit of exporting limited spans. We can actually capture the child span if that is above a threshold and from there we can go ahead and capture the parent i.e a branch till the root span where it would have the details for all spans which have still not ended and are waiting for the request to be processed . I will soon come up with the overall proposal for the same.

reta commented 7 months ago

@nishchay21 there are tradeoff for every approach:

100% sampling rate would certainly impact high load clusters (AFAIK, we barely instrumented just a few flows in OpenSearch, I suspect with more instrumentation in place the impact would be higher), the price here is mostly network bandwidth
keeping track of the traces / spans and postponing the sampling decisions (till some tail condition) would not eliminate the need to keep the traces / spans but force us to store them in memory (very likely) which to me is much more impactful for normal cluster operations (not even talking about backpropagating the states of the traces / spans across nodes in the cluster)

To have an understanding what exactly you experimented which, could you please share:

which OTEL exporter was configured
which OTEL collector (or collectors) was deployed
where OTEL collector (or collectors) was deployed

Thank you.

nishchay21 commented 7 months ago

@reta I do agree that we need to keep 100% sampling rate for above solution as well which might consume some extra resources on the cluster. To put it out this would be more helpful where someone wants to enable the tail sampling on the cluster and they get the ability to trace only the spans which are anomalous. As tail sampling has it's own caveats of using high storage cost, extra processing cost and extra network bandwidth this solution will help to reduce the same and still get to those anomalous spans.

To Answer the questions:

I used the GRPC span exporter.
The Otel collector was deployed on the node itself and was pushing to an external location[S3 in this case]

reta commented 7 months ago

As tail sampling has it's own caveats of using high storage cost, extra processing cost and extra network bandwidth this solution will help to reduce the same and still get to those anomalous spans.

I think, judging from your reply, the reasoning about this problem is not correct: we don't need anonymous spans, we need traces that are outliers. Basically, we need the whole system view for such outliers, not just "out of the context" spans (this is why it is a hard problem).

PS: If we think a bit about the trade offs, I would like to refer to you the recently added SEGMENT based replication, where we traded computed to network bandwidth.

The Otel collector was deployed on the node itself and was pushing to an external location[S3 in this case]

So this basically is not a "valid" test - the collector contributes to the network consumption but it has to be excluded. Please test with collector deployed separately so we would only count the cost of exporting spans.

nishchay21 commented 7 months ago

@reta by anomalous spans I basically meant spans which are outliers itself. So just to explain further the proposal is to not just capture one single span which is outlier but also capture the spans from there to the above chain of parent [Only if parent is still open and not ended]. So to explain on this:

As seen in the picture above we have a hierarchy in generating the spans where a parent span can generate multiple child spans or might not generate a span if the request has ended. The approach we suggest is to check a span if that is above threshold or not. If the span is above a certain threshold i.e high resource or latency is seen on the span we will mark the span to be sampled and send the information to its parent span.
The thing to note here is that the parent span can be part of same or another node and we will send the sampling information with the response to the parent span.

Once the parent span receives the information about the child being sampled there are two possibilities as seen above:

1. Parent span is still recording - If the parent span is still recording we will have the span sampled as well and store the information for that span. Once the span is marked to be sampled we will send the same sampling information in the above hierarchy as seen below until we reach the root node. This way we will be able to sample the request to its root.

2. Parent Span has stopped recording - Another possibility is that the parent span has ended and not in recording phase as seen in below picture. In this case we will not have the parent wait to be sampled and just drop the parent information. As the parent is not waiting for the child span to complete we know that the overall request is not very much dependent on this span and there is no use to capture the same. So in this case we will just ignore the span and send the information to its parent span until the same hierarchy continues.

As we have already discussed above that keeping all the traces or spans in memory for long can cause lot of memory usage so we will not do the same in this approach and just let the parent span get dropped if that is not marked to be sampled and ended before the child span.

The main advantages of this approach:

We will be able to sample all the requests having discrepancy[hot].
Only limited samples will be generated.
Storage utilization where the samples are stored will be less.

Adding further on the test involved so I basically ran a custom collector on the node which is just reading the traces and pushing out. So it is just simple traces exported out this will not involve any other data from the collector itself. Ideally I believe this will mimic or emulate the same behavior. If required, we can test on having the collector outside the node as well.

reta commented 7 months ago

@nishchay21 this sounds very complicated to me and I honestly don't understand why we need to build that:

The approach we suggest is to check a span if that is above threshold or not. If the span is above a certain threshold i.e high resource or latency is seen on the span we will mark the span to be sampled and send the information to its parent span.

This decision could be only made upon completion and will not work with async scenarios when parent span may be ended well before the child (the timing will be reconciled at the query time). The OpenSearch is using async heavily everywhere.

The thing to note here is that the parent span can be part of same or another node and we will send the sampling information with the response to the parent span.

Tracing supposed to be optional instrumentation, it will now leak everywhere: responses, requests (we need to understand which node the span comes from, the node may die meanwhile loosing the most important part of the trace completely, ...).

We are struggling with instrumentation at the moment (it is very difficult to do right) yet we are looking for complex solution for the problems we don't have (to me), building tracing framework on top of another tracing framework.

reta commented 7 months ago

Have had an offline discussion with @Gaganjuneja abd @nishchay21 , here are the conclusions we ended up with:

Conduct the tests with standalone OTEL collector to understand its limits as well as network overhead caused by sending traces from OS nodes to collector
Since OTEL makes sampling decisions upfront, the suggested feature could not be implemented in existing telemetry-otel plugin, we need to rely on the tracing instrumentation which natively support this kind of sampling and, if such exists, another experimental telemetry plugin could be introduced
1. Most importantly, the OpenSearch core implementation should not be entailed with tracing related implementation beyond the API we currently provide.
2. We have to keep in mind how this feature will work with:
  - OpenSeatch clients (we will be providing tracing support at some point)
  - OpenSeatch extensions (we will be providing tracing support at some point)

nishchay21 commented 7 months ago

Hi @reta ,

Thank you for the offline discussion. For point 2 I have done some POC around getting the outlier spans within the Otel plugin itself and not polluting the Opensearch core. Will get back to you with the details on this soon.

nishchay21 commented 7 months ago

Hi @reta,

So here is what we plan to do for the detection of an outlier span :

We will have the framework decide if the span in an outlier or not. In the current case we will be taking the latency as one of the evaluation parameter. So if the latency i.e. difference between start-time and end-time of a span is above a certain threshold we will be marking the span as an outlier. Once the span is marked to be an outlier we will go ahead and add a attribute to the given span lets say “Sampled: true”
This decision of the outlier will then be propagated to the above hierarchy via this attribute and the parent will know about the sampled status of its child span.
Now if the parent is still in the recording phase we will go ahead and add the attribute to the parent and mark the parent to be sampled as well. If the parent is not recording then the information about this parent will be stored in event of its parent so that we don't loose on the info of not-recording spans
Once this is done we need to communicate the information back to another node as well. This communication between the nodes will be done by the header itself , which will know if the current request is sampled or not.
Once the header information is received we will follow the similar procedure as discussed above on the parent node to mark the current span hierarchy for sampling.

Note: This way we will not populate the OpenSearch core itself and have the minimal implementation within the plugin. Also just to add the memory consumption of this implementation will be equivalent to the memory consumption if we enable tail sampling.

reta commented 7 months ago

Note: This way we will not populate the OpenSearch core itself and have the minimal implementation within the plugin. Also just to add the memory consumption of this implementation will be equivalent to the memory consumption if we enable tail sampling.

I doubt that is going to work (since the initial trace could be initiated way outside of the context OpenSearch itself) but we have discussed that already (I hope am missing something). Looking forward to see implementation, thank you.

nishchay21 commented 7 months ago

@reta So, If the initial traces are generated from outside the context of Opensearch than we respect the client sampling decision itself and will not override with our decision [which holds true today as well]. This will act as a new sampler within the core itself and if the decision is taken by this sampler in core then only the feature would be applied.

opensearch-project / OpenSearch