open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.92k stars 2.28k forks source link

googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

Closed nielm closed 3 months ago

nielm commented 7 months ago

Component(s)

exporter/googlecloud

What happened?

Description

When Google Cloud Monitoring exporter fails to export metrics to Google Cloud Monitoring, it drops the data. This occurs even for transient errors where the attempt should be retried.

Steps to Reproduce

Configure collector, export demo metrics.

Expected Result

Metrics are reliably exported to Google Cloud Monitoring

Actual Result

Metrics are dropped. for transient errors (such as "Authentication unavalialbe" -- when the auth cookie expires and needs to be refreshed)

Collector version

0.93.0

Environment information

Environment

GKE

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  resourcedetection:
    detectors: [gcp]
    timeout: 10s
    override: false

  k8sattributes:
  k8sattributes/2:
      auth_type: "serviceAccount"
      passthrough: false
      extract:
        metadata:
          - k8s.pod.name
          - k8s.namespace.name
          - k8s.container.name
        labels:
          - tag_name: app.label.component
            key: app.kubernetes.io/component
            from: pod
      pod_association:
        - sources:
            - from: resource_attribute
              name: k8s.pod.ip
        - sources:
            - from: connection

  batch:
    # batch metrics before sending to reduce API usage
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 5s

  memory_limiter:
    # drop metrics if memory usage gets too high
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20

exporters:
  debug:
    verbosity: basic
  googlecloud:
    metric:
      instrumentation_library_labels: false
      service_resource_labels: false

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [k8sattributes, batch, memory_limiter, resourcedetection]
      exporters: [googlecloud]

Log output

2024-02-02T19:49:38.434Z    error   exporterhelper/common.go:95 Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = Aborted desc = Errors during metric descriptor creation: {(metric: workload.googleapis.com/cloudspannerecosystem/autoscaler/scaler/scaling-failed, error: Too many concurrent edits to the project configuration. Please try again.)}.", "dropped_items": 4}

2024-02-02T20:24:44.897Z    error   exporterhelper/common.go:95 Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded", "dropped_items": 12}

2024-02-05T07:43:53.416Z    error   exporterhelper/common.go:95 Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable.", "dropped_items": 17}

Additional context

No response

github-actions[bot] commented 7 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole commented 7 months ago

Unfortunately, it isn't safe to retry failed requests to CreateTimeSeries, as the API isn't idempotent. Retrying those requests often will result in additional errors because the timeseries already exists. The retry policy is determined by the client library here: https://github.com/googleapis/google-cloud-go/blob/5bfee69e5e6b46c99fb04df2c7f6de560abe0655/monitoring/apiv3/metric_client.go#L138.

If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.

I am curious about the Authentication Backend Unavailable error. I haven't seen that one before. Is there anything unusual about your auth setup?

nielm commented 7 months ago

The retry policy is determined by the client library

Which shows that a CreateTimeSeries RPC is never retried for any condition.

I note that in #19203 and #25900 retry_on_failure was removed from GMP and GCM, because according to #208 "retry was handled by the client libraries", but this was only the case for traces, not metrics. (see comment)

Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?

While I understand that some failed requests should not be retried, there are some that should be: specifically ones that say "Please try again"!

For example the error Too many concurrent edits to the project configuration. Please try again happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.

If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.

This is not trivial as there does not seem to be a config parameter to do this, so would involve editing the source code and compiling my own version... In any case, for a collector running in GCP, exporting to GCM, it

Authentication Backend Unavailable error: Is there anything unusual about your auth setup?

Not at all: running on GKE with workload identity, using a custom service account with appropriate permissions.

If there were retries on Unavailable or Deadline Exceeded, this would not be an issue of course.

dashpole commented 7 months ago

Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?

No. This was very intentional. It was always wrong to enable retry_on_failure for metrics when using the GCP exporter, and resulted in many complaints about log spam, since a retried request nearly always fails on subsequent requests as well.

For example the error Too many concurrent edits to the project configuration. Please try again happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.

The Too many concurrent edits to the project configuration. error is actually an error from CreateMetricDescriptor, and will be retried next time a metric with that name is exported. It does not affect the delivery of timeseries information, and is only needed to populate the unit and description.

Use

exporters:
  googlecloud:
    timeout: 45s

Sorry, it looks like that option isn't documented. We use the standard TimeoutSettings: https://github.com/open-telemetry/opentelemetry-collector/blob/f5a7315cf88e10c0bce0166b35d9227727deaa61/exporter/exporterhelper/timeout_sender.go#L13 in the exporter.

github-actions[bot] commented 5 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 3 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

AkselAllas commented 3 weeks ago

Hi @dashpole (Created separate issue as well)

I am experiencing transient otel-collector failures for exporting Trace batches. e.g.: image

I have:

    traces/2:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ googlecloud ]

I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements.

Stacktrace:

"caller":"exporterhelper/queue_sender.go:101", "data_type":"traces", "dropped_items":200, "error":"context deadline exceeded", "kind":"exporter", "level":"error", "msg":"Exporting failed. Dropping data.", "name":"googlecloud", "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
    go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
    go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1

Any ideas on what to do?