spandex-project / spandex

A platform agnostic tracing library
MIT License
333 stars 53 forks source link

Support sampling #150

Closed kamilkowalski closed 5 months ago

kamilkowalski commented 1 year ago

Scope

This PR builds on #142 by @reachfh to:

  1. Implement support for managing the priority attribute (reading, updating, propagating)
  2. Switch from CircleCI to GitHub Actions supporting mix format, Credo, Dialyzer and Coveralls, together with efficient caching
  3. Update development and testing dependencies
  4. Fix issues detected by improved CI tools

I know this PR is quite big in scope, but I figured since the code was already submitted like this then I'll keep it in one PR. Happy to open a couple of smaller PRs if that's better to review.

Approach

While the original approach was to set the default priority to nil (a change from the previous 1) - I don't think there's a value that would fit all purposes. Datadog uses values from -1 to 2, while OpenTracing uses values greater or equal to 0, and nil to allow the agent to fall back to its defaults.

Defaulting it to a static value like 1 or nil wouldn't fit all backends - so I've introduced a Spandex.Adapter.default_priority/0 callback to the Adapter behaviour. This callback is only called when starting a new trace - allowing the adapter to define the default priority for new traces, while maintaining the priority of the parent trace when propagating over service boundaries.

This approach is somewhat backwards-compatible too - application developers will need to update spandex AND spandex_datadog because of the behaviour change. But if spandex_datadog always sets a sensible default priority - it will never be nil for that adapter, so any software that relied on that property being an integer will keep working.

Datadog's implementation

I've been testing the behavior of dd-trace-rb and it seems the _sampling_priority_v1 metric is always sent for top-level spans (entry spans for a given service). When using head-based sampling configured at the Agent level, the parameter is set to 0 when it was rejected or 1 when it should be kept (AUTO_REJECT and AUTO_KEEP respectively).

Agent configuration can always be overridden by application developers - either by setting sampling rules on the library level, or by forcing them to be dropped/kept by methods within the library. Regardless of the method, this application-level configuration sets the _sampling_priority_v1 param to either -1 or 2 (USER_REJECT and USER_KEEP respectively).

The Agent then uses that value to determine whether the trace should be kept (values greater than 0) or dropped (values less or equal to 0).

Additionally, top-level spans contain either a _dd.agent_psr or _dd.rule_psr parameter describing the sampling rate used. The former is present if the trace has a priority of AUTO_REJECT or AUTO_KEEP - meaning the sample rate was determined based on Agent configuration. The latter is present when the priority was set to USER_REJECT or USER_KEEP - meaning that there was a sampling rule defined at the application level.

Finally, whenever the library sends traces to the Datadog Agent - it replies with a JSON payload containing a rate_by_service attribute that describes what sampling rate should be applied to services and environments. The Agent calculates these values dynamically based on the target requests per second it's allowed to ingest - rate-limiting "chatty" services, and allowing "quiet" services to send all spans. The tracing library uses those values to perform sampling, in the absence of application-level rules.

Since it's the SpandexDatadog.Adapter that will set the default priority and communicate with the Agent - the approach presented in this PR allows us to implement things in the exact same way as in dd-trace-rb.

kamilkowalski commented 5 months ago

Closing as there's no updates and I won't have time to address this anytime soon.