open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.1k stars 2.39k forks source link

[processor/tailsampling] add low-frequency spans sampling policy #36487

Open larsn777 opened 18 hours ago

larsn777 commented 18 hours ago

Description

Hello, We occasionally encounter the need to sample low-frequency spans. However, as I understand, this type of sampling has not yet been implemented in the tail sampling module. This PR implements a new sampling policy that will allow us to sample low-frequency spans with a specified level of accuracy.

Main approaches and implementation details:

  1. Span Identifier. To identify the uniqueness or frequency of a specific span within a given set, we need to select a span identifier. I chose the pair of service name + span name as the identifier. In the future, the choice of fields for this identifier could be made configurable within the sampling policy.
  2. Count-Min Sketch (CMS). To count occurrences of unique spans in a data stream, we need to select an appropriate data structure. A naive solution might be a simple hash table, using the span identifier as the key and an occurrence counter as the value. However, with a high variability of span identifiers, such a table could become too large. As an alternative, I chose the Count-Min Sketch data structure, which generally reduces the memory required for counting span identifiers. Additionally, users can adjust the configuration parameters of this policy to find a trade-off between counting accuracy and resource usage.
  3. Sliding CMS Window. Counting low-frequency spans should happen within a limited time frame (e.g., an hour, half-hour, or minute), as otherwise, we would consume significant resources and likely lose accuracy with CMS over time. A simple approach would be to create a new CMS structure at the beginning of each interval, populate it with data, and then start over with a fresh CMS at the beginning of the next interval. However, this would reset counters at each interval start, leading to a surge of sampled spans at the beginning of each interval. To avoid this, we can use a sliding CMS window. The entire interval (window) is divided into N equal sub-intervals, each with its own CMS instance. As the window slides forward by one sub-interval, we delete the CMS instance for the earliest sub-interval and create a new one. To get the count of span occurrences, we sum occurrences across all sub-intervals and add the new occurrence to the CMS for the latest sub-interval. This approach significantly smooths out the number of spans sampled at interval boundaries.
  4. Limits on Processing and Sampling. As mentioned in the previous point, the sliding window smooths out spikes in sampled spans at interval boundaries, but it doesn’t completely eliminate them, as a large portion of unique spans will likely fall into the first sub-interval, causing minor spikes at sub-interval boundaries. To manage boundary values for sub-intervals, we can set a limit on the number of spans processed and sampled per second.

Testing

Almost all the new code is covered by unit tests.

Documentation

A brief description of the new policy has been added to the module's README. Additionally, an example configuration has been included in the README.