open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
326 stars 157 forks source link

Threshold propagation in trace state #235

Closed kentquirk closed 5 months ago

kentquirk commented 11 months ago

This is intended to capture the proposal developed by the Sampling SIG.

oertl commented 10 months ago

Currently, everything is very weakly worded, because the text contains many SHALLs and SHOULDs. Shouldn't we be stricter and require that any compliant consistent sampler implementation MUST follow all the rules defined here?

kentquirk commented 10 months ago

@oertl According to this RFChttps://www.[ietf.org/rfc/rfc2119.txt](https://www.ietf.org/rfc/rfc2119.txt), MUST and SHALL are equivalent. I thought I had read a document that said to prefer SHALL, but now I can't find it, and I'm finding lots of other ones that prefer MUST, so yeah, I've changed it.

I've tried to address most of the comments.

jmacd commented 9 months ago

An illustration may help, here is one I've been using. OTel tracestate sampling v2 (6)

kentquirk commented 9 months ago

I've added Josh's image to the document and fixed lint errors. Are there further changes needed?

jmacd commented 7 months ago

For the record, this is not a stale OTEP. The Sampling SIG has met and discussed the merits of several alternatives and we expect this process to continue. We have some amount of uncertainty about how to reflect unknown sampling probability, zero sampling probability, and 100% sampling probability.

jmacd commented 7 months ago

@kentquirk I'm ready to go with the rejection threshold approach.

Rejection threshold pros:

Acceptance threshold pros:

I still find myself confused how to reason about a span with no t-value. This can result from a legacy always-on sampler in the SDK. This can also result from a sampling.priority attribute causing sampling when the consistent sampler would not, in which case again no t-value. Is there some attribute we can / should use to indicate the difference between a span that was always-on from an old SDK and a span that was rejected by a consistent sampler and nevertheless exported. In last week's SIG, we discussed using attributes, not tracestate, to convey such information. For example, when a span is selected by a periodic, non-probabilistic process (e.g., one-per-minute) and rejected by a consistent sampler, how should we avoid counting it again as one span from a legacy SDK w/ no t-value, later in the pipeline? I'm willing to say this doesn't really matter. Each application can choose how it wants to treat spans with no t-value.

kentquirk commented 7 months ago

Result of 7Dec23 Sampling SIG meeting: we have decided to convert this to a rejection threshold and I will make those edits this week.

kentquirk commented 5 months ago

I have attempted to update the document based on a rejection threshold model. This is really breaking my brain. It solves one problem but I'm finding it extremely hard to think about.

I am not at all convinced that the example beginning at line 24 is correct (and maybe we should even delete it, as I don't find it helpful), and I would very much appreciate if people would carefully think about whether I correctly fixed all the places where we had <, >, increase, decrease, etc.

kentquirk commented 5 months ago

I believe that this is now (finally!) ready to merge.

jmacd commented 5 months ago

@yurishkuro will you please review and consider approving? We have reached agreement in the Sampling SIG and have begun to press for approvals. Thank you.

jmacd commented 5 months ago

@bogdandrutu will you please review and consider approving? We have reached agreement in the Sampling SIG and have begun to press for approvals. Thank you.

jmacd commented 5 months ago

Thank you @kentquirk!