tail sampling traces - Githubissues

rbtcollins commented 1 year ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

We use opentelemetry to emit traces from our clusters. The volume of traces is quite high. However most traces don't offer value.

Some vendors like (Honeycomb)[https://www.honeycomb.io/] have (tail samplers)[https://docs.honeycomb.io/manage-data-volume/refinery/] that dramatically reduce the number of traces that need to be kept to provide a holistic view of the running service.

Being able to this outside of proprietary vendor tooling would be great.

tl;dr: see low cardinality events but reduce span egress and storage 80% or more

Attempted Solutions

I looked but couldn't see anything in the docs about tail sampling.

However from an architecture perspective I'd expect something like:

source(otel) -> tailsampler w/5GB look-back ->sink(otel to vendor GRPC endpoint)

Proposal

No response

References

No response

Version

we haven't adopted vector at this point

zamazan4ik commented 1 year ago

Well, it could be implemented via a simple random function on a transformation step with VRL. Something like this (warning - pseudocode):

let value = random(100);
if (value < 10)
    pass_logs_to_sink();

But there is no such a function in VRL yet - @jszwedko probably could help here.

rbtcollins commented 1 year ago

Thats not what is implied by tail sampling in the tracing domain. Head sampling can do that random based sampling on the trace id and pass non-recording spans down into the stack.

Have a look at the refinery docs for more details but the core concept is to signal boost. For instance, you can build a tuple (operation, error-status) and then:

buffer all spans for a trace (maybe 5-10, or possibly 10's of K of spans)
when some threshold is reached, or the root span is received (which should have error rolled-up to the root) then making a sample decision.
if the current sample target is higher than the sent traces, sample the trace
if the relative cardinality for any of the tuples in the trace is lower than the average sampled cardinality, sample the trace

jszwedko commented 1 year ago

Thanks for opening this @rbtcollins !

We discussed this issue a bit today. It is something we think fits in the vision of Vector but is a heavy lift to add since currently Vector has no shared state between instances, which seems to be a requirement for this feature. For that reason, it's unlikely that we'll add this in the near future.

It would be easier to add a local-only tail sampling that only looks at traces received by a single Vector instance.

vectordotdev / vector