vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.72k stars 1.57k forks source link

tail sampling traces #16929

Open rbtcollins opened 1 year ago

rbtcollins commented 1 year ago

A note for the community

Use Cases

We use opentelemetry to emit traces from our clusters. The volume of traces is quite high. However most traces don't offer value.

Some vendors like (Honeycomb)[https://www.honeycomb.io/] have (tail samplers)[https://docs.honeycomb.io/manage-data-volume/refinery/] that dramatically reduce the number of traces that need to be kept to provide a holistic view of the running service.

Being able to this outside of proprietary vendor tooling would be great.

tl;dr: see low cardinality events but reduce span egress and storage 80% or more

Attempted Solutions

I looked but couldn't see anything in the docs about tail sampling.

However from an architecture perspective I'd expect something like:

source(otel) -> tailsampler w/5GB look-back ->sink(otel to vendor GRPC endpoint)

Proposal

No response

References

No response

Version

we haven't adopted vector at this point

zamazan4ik commented 1 year ago

Well, it could be implemented via a simple random function on a transformation step with VRL. Something like this (warning - pseudocode):

let value = random(100);
if (value < 10)
    pass_logs_to_sink();

But there is no such a function in VRL yet - @jszwedko probably could help here.

rbtcollins commented 1 year ago

Thats not what is implied by tail sampling in the tracing domain. Head sampling can do that random based sampling on the trace id and pass non-recording spans down into the stack.

Have a look at the refinery docs for more details but the core concept is to signal boost. For instance, you can build a tuple (operation, error-status) and then:

jszwedko commented 1 year ago

Thanks for opening this @rbtcollins !

We discussed this issue a bit today. It is something we think fits in the vision of Vector but is a heavy lift to add since currently Vector has no shared state between instances, which seems to be a requirement for this feature. For that reason, it's unlikely that we'll add this in the near future.

It would be easier to add a local-only tail sampling that only looks at traces received by a single Vector instance.