open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.44k forks source link

Don't retry on permanent errors #5260

Closed jvilhuber closed 2 years ago

jvilhuber commented 2 years ago

Describe the bug Some errors from tempo are considered 'final' and shouldn't be retried. For example TRACE_TOO_LARGE will not succeed no matter how many times we retry.

Steps to reproduce Send a trace that is too large.

What did you expect to see? I've seen other errors flagged as "not retryable" (can't remember the exact message). I would expect at least TRACE_TOO_LARGE errors to not be retried. The difficult question is likely how to identify which errors are permanent.

What did you see instead? {"level":"error","ts":1650951011.5039303,"caller":"exporterhelper/queued_retry.go:149","msg":"Exporting failed. Try enabling retry_on_failure config option.","kind":"exporter","name":"otlp","error":"Permanent error: rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 199127 bytes to trace c20ec8bb377c8bc6984a8088f3ddd87d","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\tgo.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry.go:149\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\tgo.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/traces.go:135\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1\n\tgo.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry_inmemory.go:118\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume\n\tgo.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/internal/bounded_memory_queue.go:99\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2\n\tgo.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/internal/bounded_memory_queue.go:78"}

What version did you use? otel-collector 0.47.0 docker image

What config did you use?

    "exporters":
      "otlp":
        "endpoint": "XXX:443"
        "retry_on_failure":
          "enabled": false
          "initial_interval": "5s"
          "max_elapsed_time": "150s"
          "max_interval": "15s"
        "sending_queue":
          "enabled": true
          "num_consumers": 10
          "queue_size": 3000
        "timeout": "5s"
    "extensions":
      "health_check": {}
      "memory_ballast":
        "size_in_percentage": 50
      "pprof":
        "endpoint": ":1777"
      "zpages": {}
    "processors":
      "attributes":
        "actions":
        - "action": "insert"
          "key": "datacenter"
          "value": "XXX"
      "batch":
        "send_batch_max_size": "0"
        "send_batch_size": "8192"
        "timeout": "200ms"
      "memory_limiter":
        "check_interval": "5s"
        "limit_percentage": 100
        "spike_limit_percentage": 10
      "probabilistic_sampler":
        "hash_seed": 8349990
        "sampling_percentage": 100
    "receivers":
      "jaeger":
        "protocols":
          "grpc": {}
          "thrift_binary": {}
          "thrift_compact": {}
          "thrift_http": {}
      "opencensus":
        "endpoint": "0.0.0.0:55678"
      "otlp":
        "protocols":
          "grpc": {}
          "http": {}
      "zipkin": {}
    "service":
      "extensions":
      - "health_check"
      - "pprof"
      - "zpages"
      "pipelines":
        "traces/1":
          "exporters":
          - "otlp"
          "processors":
          - "memory_limiter"
          - "probabilistic_sampler"
          - "batch"
          - "attributes"
          "receivers":
          - "otlp"
          - "jaeger"
          - "zipkin"
          - "opencensus"
      "telemetry":
        "logs":
          "encoding": "json"
          "level": "info"

Environment Kubernetes, docker image 0.47.0

Additional context

bogdandrutu commented 2 years ago

This is not a bug, at most a misleading error message.

You get the expected behavior, your exporter fails to export and the configuration disables the "retry" (retry_on_failure::enabled: false), hence you hit this code https://github.com/open-telemetry/opentelemetry-collector/blob/v0.47.0/exporter/exporterhelper/queued_retry.go#L149 where we print the expected message.

In this case indeed we do not check for error if it is or not retryable, we simply print this generic message.

jvilhuber commented 2 years ago

But what happens when retry is enabled (I actually forgot I turned it off in the config for various temporary reasons)? I assume it'll try to retry, but again: In this case that would be pointless.

bogdandrutu commented 2 years ago

In case retry is enable we check for the error to see if it is retryable or not.

bogdandrutu commented 2 years ago

See the logic https://github.com/open-telemetry/opentelemetry-collector/blob/150c1ede2f7fb5607035d3c7a10bfcab61c39afe/exporter/otlpexporter/otlp.go#L165

jvilhuber commented 2 years ago

Thanks!