open-telemetry / opentelemetry-cpp-contrib

https://opentelemetry.io/
Apache License 2.0
121 stars 130 forks source link

nginx: design question: Why does the sampler default to parent_based=false? #331

Open dgoffredo opened 9 months ago

dgoffredo commented 9 months ago

Hi,

I'm writing an OpenTelemetry TracerProvider implementation that is a wrapper around Datadog's tracing library. My intention is to eventually add support for that provider in this project's instrumentation/nginx module.

I notice that in the sampling configuration, sampler.parent_based defaults to false, which means that while nginx will extract trace context from an incoming request, it will by default not honor the sampling decision included in that context. Instead it will make its own sampling decision (which defaults to "always on").

Why was this chosen as the default? In the hypothetical Datadog case, I'd prefer parent_based to default to true.

@johanneswuerbach, the author of the configurable sampler.

trajano commented 7 months ago

In my case I found it the other way by default with Java instrumentation which is parent_based_always_on which I actually set to parent_based_always_off in my service.

The advantage of parent_based_always_off is it allows me to drop the whole span if the parent trace is filtered using a filter in OpenTelemetry Collector like this

  filter/grafana_noise:
    error_mode: ignore
    traces:
      span:
        - (resource.attributes["service.name"] == "grafana") and (attributes["http.url"] == "/grafana/metrics") and (attributes["http.method"] == "GET")
        - (resource.attributes["service.name"] == "grafana") and (name == "open session") and (attributes["transaction"] == false)

  filter/drop_actuator:
    traces:
      span:
        - (attributes["http.method"] != nil) and (attributes["http.method"] == "GET") and (attributes["http.route"] == "/actuator/prometheus")
        - (attributes["http.method"] != nil) and (attributes["http.method"] == "GET") and (attributes["http.route"] == "/actuator/health")
        - (attributes["http.url"] == "/grafana/metrics") and (attributes["http.method"] == "GET")
        - (attributes["http.url"] == "/metrics") and (attributes["http.method"] == "GET")

When the traefik span that starts the whole process gets dropped the whole trace itself gets dropped. Primarily I want to remove the /metrics polling and /health checks

So IMHO I think parent_based_always_off is a more reasonable default and hope that is the case in Java instrumentation.

However... I would rather the default be consistent across all tooling so what you've flagged here is an inconsistency between instrumentation libraries.

dgoffredo commented 7 months ago

I don't know what parent_based_always_on and parent_based_always_off mean, but I would assume they mean "use the sampling decision extracted from the parent, or otherwise default to (on|off)."

If my assumption is correct, then that means that the (Java, in your example) tracer would always honor a sampling decision extracted from the traceparent request header.

The behavior in the nginx module, though, is that it will not extract the sampling decision from traceparent by default. I was asking why that is the default.

trajano commented 7 months ago

Yeah I think they should make these consistent across instrumentation. So in my case I think both parent_based_always_on (which is my default) and parent_based_always_off is equivalent to sampler.parent_based=true and the line before that

OtelSamplerType type = OtelSamplerAlwaysOn;

will be equivalent to always_on So the equivalent default for your instrumentation with the Java instrumentation is always_on .
Like I said earlier hopefully they streamline the defaults so those switching from one language to the another won't get dinged by assuming the default.