[Tracing Framework] RFC - Sampling Strategy

Gaganjuneja commented 1 year ago

What is Sampling?

Sampling is a statistical method employed to choose a subgroup that can effectively represent the entire population and be readily extrapolated.

Why we need Sampling?

As we consider the instrumentation of generic constructs such as rest controllers, transport actions, and task managers, a notable issue arises. Our system encompasses more than 200+ transport actions and a variety of rest actions, leading to the potential generation of numerous spans. However, each span incurs a non-negligible cost. Hence, it is crucial to develop a Sampling strategy that enables us to select a subset of spans from the total pool, effectively representing the entire population or system.

Different Sampling Options

There are two distinct sampling options available in OpenTelemetry.

Head Sampling: Head sampling takes place just before the creation of a span. Since the decision needs to be made as early as possible, it relies on arbitrary factors like randomly selecting a percentage of spans. While this is a straightforward technique to implement, it lacks request/trace-specific data, which may limit its ability to make intelligent decisions.

Tail Sampling: In contrast to head sampling, tail sampling makes its sampling decision at the end of the entire trace when data from all spans is available. This type of sampling occurs at the collector level. Here, sampling can be based on various criteria such as latency, attribute values, status (e.g., error, success), etc. One major advantage is static stability when we are especially measuring the resource consumption. One challenge with this strategy is that it necessitates having all spans sent to a single collector, which could be problematic for distributed systems like OpenSearch where sending all spans to a specific data node might be challenging. In the following sections, we will explore different strategies related to these sampling methods.

OpenSearch Sampling Requirements

To determine the most suitable sampling strategy, it is essential to consider the specific requirements and goals for implementing tracing in the system. Here are some common scenarios:

Debugging a Specific Request: When tracing is primarily intended for debugging a particular request, it is advisable to capture 100% of the data related to that specific request. This level of detailed tracing allows for a comprehensive examination of the request's behavior.
Debug the issues: For debugging failures, 4xx, 5xx, etc. we need to sample the 100% traffic from the Head and later on sample spans based on some attributes like status, attributes, resource consumption etc in the tail sampling.
Debugging System-Wide Issues: For debugging system-wide issues, capturing 10-50% of the traces is typically sufficient. This level of sampling enables the identification and understanding of failures, errors, and latencies across the entire system without overwhelming the tracing infrastructure.
Application Baseline: If the goal is to observe the application's normal behavior and ensure it is functioning as expected, a minimal sampling rate of 5-10% of the traces should be enough. This provides a representative sample of the application's regular performance.
Tracing Code Paths and Architecture: When the focus is on tracing overall code paths and understanding the system's architecture, a very minimal tracing approach should be adequate. In this case, a low sampling rate can still provide valuable insights without excessive data collection.

By carefully considering the specific use cases and objectives of tracing, a suitable sampling strategy can be chosen to strike the right balance between capturing enough data for analysis while minimizing the impact on performance and storage resources.

OpenSearch sampling strategy

For a distributed system like OpenSearch, an effective sampling strategy often combines both head and tail sampling techniques to achieve the desired tracing goals. Let's delve into the details of the proposed sampling strategy for OpenSearch:

Head Based OpenSearch sampling

OpenSearch, being a distributed system, contains numerous Transport and Rest actions, all of which need to be instrumented by default with a minimal sampling rate. However, specific critical codepaths, like Search and Indexing, could be sampled at a higher rate for more in-depth analysis.
The sampling rates should be configurable on-the-fly through settings, allowing flexibility in adjusting the sampling percentages based on requirements.
The strategy should enable the ability to disable tracing for certain transport actions, such as health checks, to reduce unnecessary overhead.
Sampling rates could be configured using a following schema that accommodates various use cases and performance considerations.

{
  "action_strategies": [
    {
      "action": "search_action",
      "type": "probabilistic",
      "param": 1.0
    },
    {
      "action": "bulk_action",
      "type": "probabilistic",
      "param": 1.0
    },
    {
        "action": "internal:coordination/fault_detection/follower_check",
        "type": "probabilistic",
        "param": 0.0
    },
    {
        "action": "internal:coordination/fault_detection/leader_check",
        "type": "probabilistic",
        "param": 0.0
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.001
  }
}

Tail Sampling

As most requests in OpenSearch return 200 OK responses and stay within Service Level Agreements (SLAs), not all of these requests need to be traced. Tail sampling becomes valuable in this context.

OpenSearch, being a distributed system, contains numerous Transport and Rest actions, all of which need to be instrumented by default with a minimal sampling rate. However, specific critical codepaths, like Search and Indexing, could be sampled at a higher rate for more in-depth analysis.
Tail sampling would involve sampling traces based on specific parameters to filter out traces that are less relevant for analysis, thereby conserving resources.
Implementing tail sampling can be challenging, as it requires sending all spans belonging to a single trace to a particular collector.

Options for Tail Sampling in OpenSearch:

Send Spans as Part of Response: Spans are sent as part of the response back to the coordinator node, which can then export these spans to the local collector. The overall impact on performance should be assessed through performance runs.

Pros:
- Simple to achieve.
- Uniform distribution of requests should ensure a similar number of spans per collector/coordinator.
Cons:
- Increased response size due to added span data.
- Managing the lifecycle of a span may introduce some complexity.
Export Span to Coordinator Node Always: A custom span exporter can be created to directly export spans from the OpenSearch core to the collector running on the coordinator node. This might require opening the GRPC port for node intercommunication.

Pros:
- Spans are not sent as part of the response, reducing response size.
- Easy to identify and propagate the coordinator IP from the request to child spans.
Cons:
- Harder to debug and monitor.
- Increased data network utilization for internode communication.
Export Span to Local Collector: Spans are initially exported to a local collector before being distributed to a single collector that handles all spans belonging to a particular trace. This could involve multiple levels of collectors.

Pros:
- Utilizes out-of-the-box capabilities from OpenTelemetry (otel).
- Distribution of spans occurs outside the OpenSearch process.
Cons:
- Every collector needs to be aware of the other nodes, introducing some complexity.
- Handling corner cases, such as node failures, requires careful consideration.

In conclusion, tail sampling in OpenSearch requires careful consideration as there is no clear winner among the options discussed. Option 2 and Option 3 are already configurable in otel. Option 1 may introduce resource overhead and require feasibility test. Looking forward to feedback from the community on these tail sampling options.

Other limiting factors

Indeed, there are several other limiting factors and considerations when implementing tracing in a distributed system like OpenSearch:

Overuse of Spans: Each span comes with a cost, both in terms of performance overhead and storage requirements. Therefore, it's crucial to exercise caution when adding spans. Overusing tracing can lead to a significant impact on overall system performance and resource utilization.
Limiting Horizontally: Sampling techniques play a vital role in limiting the number of traces that are captured and recorded. By intelligently sampling requests and spans, the system can avoid an overwhelming amount of tracing data while still obtaining valuable insights.
Limiting Vertically with Levels: Implementing span levels can be an effective way to limit the number of spans per trace. By defining levels of detail, the system can control the depth and granularity of the tracing information collected for each request, allowing more focused analysis when needed.
Max Spans: Enforcing limits on the number of spans per unit of time (e.g., per minute) is another useful approach to prevent excessive tracing. Setting a maximum number of spans ensures that the tracing infrastructure doesn't get overwhelmed during peak usage periods. This may result into the partial traces.

By thoughtfully considering these limiting factors and incorporating the appropriate sampling techniques, level definitions, and span limits, the tracing implementation in OpenSearch can strike the right balance between capturing sufficient data for analysis and maintaining a performant and efficient distributed system. I am doing POC with couple of approaches and will update the results here.

Gaganjuneja commented 1 year ago

@reta @shwetathareja @Bukhtawar @suranjay your thoughts on this?

Gaganjuneja commented 1 year ago

@reta @backslasht @shwetathareja @Bukhtawar @suranjay reminder for your thoughts on this.

reta commented 1 year ago

@Gaganjuneja thanks a lot for the proposal (and my apologies for the delay)

I think it makes sense to start with Head Based OpenSearch first since this is the most straightforward way to get traces out, OpenTelemetry has a good support for probabilistic sampling.

Sampling rates could be configured using a following schema that accommodates various use cases and performance considerations.

I would strongly -1 the per-action configuration for the following reasons (at least at this stage):

the initial request comes to coordinator node which fans out the request to the relevant nodes (bulk is a great example of that), if each node uses probabilistic sampling, it is going to be a complete mess
the tracer is configured on each node (this is plugin by and large), maintaining such a complex configuration across fleet of nodes is problematic
the action is materialized way too late in the request processing pipeline (so we would need a dedicated processor that need to collect the trace and than discard it at some point)

Options for Tail Sampling in OpenSearch:

OTel has Tail Sampling Processor [1] that consolidates traces on the processor side and make the decision based on policies, have you seen it? It also covers traces depth mentioned in the Other limiting factors

[1] https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor

On a general note, I would strongly advocate to make a decision for sampling strategies to output either consistent complete trace or no trace at all (no matter if it is head or tail sampling). Introducing any randomness or/and inconsistencies in the middle of the trace (like probabilistic spans dropping, level that we discussed previously) would render the feature unusable and confusing since each request in general would produce unique trace hierarchy.

Gaganjuneja commented 1 year ago

@reta Thanks for your reply.

On a general note, I would strongly advocate to make a decision for sampling strategies to output either consistent complete trace or no trace at all (no matter if it is head or tail sampling).

+1 on this and totally aligned.

I would strongly -1 the per-action configuration for the following reasons (at least at this stage):

the initial request comes to coordinator node which fans out the request to the relevant nodes (bulk is a great example of that), if each node uses probabilistic sampling, it is going to be a complete mess

Yes, so we would be applying this strategy for sampling on the root span only and later on all the spans would respect the parent's decision using parent-based sampling. It works across nodes as well with tracestate flags.

the tracer is configured on each node (this is plugin by and large), maintaining such a complex configuration across fleet of nodes is problematic

I think most of the other actions should follow the default strategy except client facing actions like bulk and search. There should be a minimum number of actions which requires separate configuration.

the action is materialized way too late in the request processing pipeline (so we would need a dedicated processor that need to collect the trace and than discard it at some point)

Action based sampling would be applicable only for the root span and action could be anything like TransportAction or RestAction. We can change the name to operation if this is misleading.

OTel has Tail Sampling Processor [1] that consolidates traces on the processor side and make the decision based on policies, have you seen it? It also covers traces depth mentioned in the Other limiting factors

Yes, but as the requirement is to have all the spans related to a trace on the same collector for the effective sampling decisions, I have proposed the above approaches to consolidate the spans to a single collector. Thereafter Tail Sampling Processor will take over.

reta commented 1 year ago

Thanks @Gaganjuneja

Action based sampling would be applicable only for the root span and action could be anything like TransportAction or RestAction. We can change the name to operation if this is misleading.

This is impossible to decide on the root level, as I mention the action name is materialized very late in the flow (we have a whole network layer before it that will have to be instrumented). Let us make it work first with the Otel sampling, that we could take any other decisions.

Yes, but as the requirement is to have all the spans related to a trace on the same collector for the effective sampling decisions, I have proposed the above approaches to consolidate the spans to a single collector. Thereafter Tail Sampling Processor will take over.

We are not building the distributing tracing infrastructure (I hope), our goal is instrumentation (and we just in the beginning of it). If we have the solution (like https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor) that solves the problem (that we don't need to solve), why do we need to invent our own?

shwetathareja commented 1 year ago

Thanks @Gaganjuneja for the proposal on the Sampling. +1 on starting with head based Sampling and I tend to agree with @reta that managing per action based configuration would be a pain. But, thinking about usage of Request Tracing Framework in production use cases. Most of the time it would be the case where user would be either debugging an issue with indexing or search workload and not both. And, in case of large cluster, it would generate too many spans due to high tps if it has to be enabled for all flows and could cause performance overhead. Hence, basic filtering of index vs search APIs would be useful otherwise it would be hard to debug live issues in production with this framework.

reta commented 1 year ago

Hence, basic filtering of index vs search APIs would be useful otherwise it would be hard to debug live issues in production with this framework.

Thanks @shwetathareja , this very easy to solve with tracing on demand, where for every search or indexing request in question user could add trace=true query / request parameter.

Gaganjuneja commented 1 year ago

Thanks @reta & @shwetathareja providing your valuable comments.

I concur that on-demand sampling would be effective for debugging requests. This approach is reactive, but certain use cases necessitate proactive data collection. For instance, establishing monitoring through traces and metrics, identifying performance bottlenecks, root cause analysis, and generating a cluster insights dashboard that highlights top resource-consuming queries. For such scenarios, we must sample 100% of traces using head-based sampling and subsequently make intelligent decisions through tail-based sampling.

Regarding configuration maintenance, I agree that the proposed action-based schema is not sustainable. Instead, we should define appropriate default limits and allow for potential overrides using AffixSettings, similar to what we do for loggers.

After reviewing the code, I also realized that the action materializes quite late. Can we do sampling based on the URI once the request has landed in the RequestHandler (both HTTP and Netty4)?

In terms of a generic implementation, we can consider sampling based on the value of the "operation" or some attribute if the Span is a root span and that attribute is defined in the settings. Otherwise, the default limit should apply, while also taking into account the "trace=true" setting.

reta commented 1 year ago

I concur that on-demand sampling would be effective for debugging requests.

It is very effective when you pinpointed the limited set of requests that are anomalies (usually coming out from adaptive/tail sampling based on criteria).

After reviewing the code, I also realized that the action materializes quite late. Can we do sampling based on the URI once the request has landed in the RequestHandler (both HTTP and Netty4)?

This is still too late, the trace would have been started already. But again, here adaptive sampling (tail based sampling) would help 100% - the action name / URI / etc could be used to make a decision to trace or not.

Instead, we should define appropriate default limits and allow for potential overrides using AffixSettings, similar to what we do for loggers.

What kind of limits you are referring to?

Gaganjuneja commented 1 year ago

@reta & @shwetathareja, I am thinking of covering the following scenarios in the head sampling.

Blanket Probabilistic Sampling rate - We can use the OTel probabilistic sampler. Rate should be configurable through telemetry settings.
Override for on demand - We will override the blanket probabilistic sampler for the requests where header has "trace=true" attribute set.
Override for action/operation through settings - We can provide one more option for users to override the blanket probabilistic sampling for certain actions. They should be able to configure the action and sampling rate through logger like affix setting e.g. "trace.action-name=0.2". This will be needed for some internal actions, like background jobs, health checks, etc.
Custom processor - We would also require a custom processor which should be able to discard the parent spans in case the current span is not sampled. This is a bit tricky and requires a lot of assumptions and scenarios to be considered. I will post more on this separately.

In all the above three scenarios these limits and overrides will be taken into account while creating the root span and following child spans will respect the parent sampling decision. Your thoughts?

reta commented 1 year ago

thanks @Gaganjuneja , I would suggest to focus on 1 & 2 now and keep others (3 & 4) as possible future improvements. The issue is that we have absolutely zero instrumentation now, once we have it, we could think about the gaps (in sampling, etc) to make it better.

Gaganjuneja commented 1 year ago

thanks @Gaganjuneja , I would suggest to focus on 1 & 2 now and keep others (3 & 4) as possible future improvements. The issue is that we have absolutely zero instrumentation now, once we have it, we could think about the gaps (in sampling, etc) to make it better.

Agreed. I had listed these items in priority order. Instrumentation PR coming your way next week 😊

opensearch-project / OpenSearch