open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.95k stars 2.29k forks source link

[loadbalancingexporter] Not properly batching service traces #13826

Open crobertson-conga opened 2 years ago

crobertson-conga commented 2 years ago

Describe the bug New loadbalancingexporter option for grouping traces by service name is sending all traces in a block every time instead of splitting up the set of traces to those that only belong to the specific export.

Steps to reproduce Use new routing_key: service option to start splitting up the traces by service. Have at least 2 receiving collectors. In the receiving collectors, use a resource detection processor to augment the trace payload so you can see which collector is receiving a trace.

What did you expect to see? All traces from a specific service name should have the same receiving processor

What did you see instead? Traces from a specific service name went to both processors.

What version did you use? 0.59.0

What config did you use?

      loadbalancing/spanmetrics:
        routing_key: service
        protocol:
          otlp:
            tls:
              insecure: true
        resolver:
          dns:
            hostname: <some_k8)sservice_to_target_collectors>
            port: 4317
            interval: 1m

Environment Doesn't matter

Additional context Add any other context about the problem here.

crobertson-conga commented 2 years ago

@aishyandapalli this is an FYI, I think your new feature in regards to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/12421 has a bug in it. I think its stemming from https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/b30f3e9e5242b8f60839a05057bea9317e1caf25/exporter/loadbalancingexporter/trace_exporter.go#L121 consuming all traces instead of just the ones associated with the routing key.

crobertson-conga commented 2 years ago

Actually I'm not sure if that's the problem. I set up batching so that it had a max size of one and all my span metrics collectors are still getting signals across all services Screen Shot 2022-09-01 at 7 10 15 PM

      batch/one: # super inefficient data-wise, but it looks like the loadbalancing exporter doesn't split properly
        send_batch_size: 1
        send_batch_max_size: 1

I have a resource processor on the span metrics processor that annotates the traces coming in, hence the aggregator dimension.

The collector doing span metrics is forwarded metrics from the loadbalancer exporter.

[Edge collectors] -> [Main central collector] -> [Spanmetrics collector(s)]

crobertson-conga commented 2 years ago

So doing some more testing leads me to believe it may be due to forcibly closed connections on grpc making the LB move to the next available instance. I will close this if it turns out to be the case.

crobertson-conga commented 2 years ago

Okay, so this was due to my configuration which was interrupting the grpc connection regularly. Sorry

crobertson-conga commented 2 years ago

Okay after removing my batching of size one, the issue reappeared. I had two problems, one which is resolved by not allowing connections to terminate artificially. The other is if the traces are in a batch with multiple service names, they get get sent to all target collectors with the loadbalancer processor.

This leads me to believe the original issue where all the spans are being sent per endpoint regardless of actual service is correct.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.