open-telemetry / opentelemetry-java

OpenTelemetry Java SDK
https://opentelemetry.io
Apache License 2.0
1.96k stars 806 forks source link

JaegerRemoteSampler inconsistent Implementation #5504

Open rotscher opened 1 year ago

rotscher commented 1 year ago

The implementation of SDK Jaeger Remote Sampler Extension is not according to the description of the example of Jaeger Remote Sampler.

{
  "service_strategies": [
    {
      "service": "foobar",
      "type": "ratelimiting",
      "param": 2
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.2,
    "operation_strategies": [
      {
        "operation": "/metrics",
        "type": "probabilistic",
        "param": 0.0
      }
    ]
  }
}

Having the remote sampling configuration above I would expect service foobar to be rate limited to 2 and all other services to be sampled with probability 0.2. As an exception operation /metrics of all services (including foobar) never gets sampled.

With the current implementation only rate limiting is configured for service foobar which is also applied to /metrics operation.

Also when operation_strategies are configured for a service then the default type of the service is always "probabilistic" even when "ratelimiting" is configured.

I've used the following libraries and versions:

And activated the autoconfiguration with export OTEL_TRACES_SAMPLER=parentbased_jaeger_remote

jack-berg commented 1 year ago

@pavolloffay can you comment on this?

kuujis commented 6 months ago

@jack-berg @jkwatson - I would like to try solving this if thats ok :)

pavolloffay commented 6 months ago

correct it should be fixed in OTEL SDK

kuujis commented 5 months ago

correct it should be fixed in OTEL SDK

I was debugging this over the weekend and to me it seems the issue is with jaeger not returning probablistic operation sampling strategies if the service configuration is ratelimiting. In theory it would be possible to call Jaeger twice, get strategies for a given service, then get defaults and merge them before starting to use them, but to me it looks like a mistake in Jaeger backend and this "calling jaeger twice" does not feel right.

kuujis commented 5 months ago

Just a small update - after updating jaeger to return the operation level default probablistic sampling strategies for ratelimiting service level sampling strategy configurations - the issue can't be reproduced anymore. So I assume this could be closed once the PR to jaeger gets accepted OR if it does not - I'll make a PR to otel with double jaeger call and merge of the results (I still feel a bit dirty about this option though)

kuujis commented 4 months ago

This should work out of the box with next release of Jaeger, provided flag sampling.strategies.bugfix-5270 is used when launching it. Since the fix might result in unexpected impact to the Jaeger users - this will become default behavior only in 1.57 version. See https://github.com/jaegertracing/jaeger/issues/5270 for more details.