Closed nielm closed 5 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Unfortunately, it isn't safe to retry failed requests to CreateTimeSeries, as the API isn't idempotent. Retrying those requests often will result in additional errors because the timeseries already exists. The retry policy is determined by the client library here: https://github.com/googleapis/google-cloud-go/blob/5bfee69e5e6b46c99fb04df2c7f6de560abe0655/monitoring/apiv3/metric_client.go#L138.
If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.
I am curious about the Authentication Backend Unavailable error. I haven't seen that one before. Is there anything unusual about your auth setup?
The retry policy is determined by the client library
Which shows that a CreateTimeSeries
RPC is never retried for any condition.
I note that in #19203 and #25900 retry_on_failure was removed from GMP and GCM, because according to #208 "retry was handled by the client libraries", but this was only the case for traces, not metrics. (see comment)
Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?
While I understand that some failed requests should not be retried, there are some that should be: specifically ones that say "Please try again"!
For example the error Too many concurrent edits to the project configuration. Please try again
happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.
If you are seeing context deadline exceeded errors in particular, I would recommend increasing the timeout to ~45s.
This is not trivial as there does not seem to be a config parameter to do this, so would involve editing the source code and compiling my own version... In any case, for a collector running in GCP, exporting to GCM, it
Authentication Backend Unavailable error: Is there anything unusual about your auth setup?
Not at all: running on GKE with workload identity, using a custom service account with appropriate permissions.
If there were retries on Unavailable or Deadline Exceeded, this would not be an issue of course.
Could this be an oversight that retries were not enabled in metrics client libararies when they were in Logging and Tracing?
No. This was very intentional. It was always wrong to enable retry_on_failure for metrics when using the GCP exporter, and resulted in many complaints about log spam, since a retried request nearly always fails on subsequent requests as well.
For example the error Too many concurrent edits to the project configuration. Please try again happens always when a counter is used for the first time in a project, or when a new attribute is added - it seems that GCM cannot cope with a CreateTimeSeries which updates a metric.
The Too many concurrent edits to the project configuration.
error is actually an error from CreateMetricDescriptor, and will be retried next time a metric with that name is exported. It does not affect the delivery of timeseries information, and is only needed to populate the unit and description.
Use
exporters:
googlecloud:
timeout: 45s
Sorry, it looks like that option isn't documented. We use the standard TimeoutSettings: https://github.com/open-telemetry/opentelemetry-collector/blob/f5a7315cf88e10c0bce0166b35d9227727deaa61/exporter/exporterhelper/timeout_sender.go#L13 in the exporter.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Hi @dashpole (Created separate issue as well)
I am experiencing transient otel-collector failures for exporting Trace batches. e.g.:
I have:
traces/2:
receivers: [ otlp ]
processors: [ tail_sampling, batch ]
exporters: [ googlecloud ]
I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements.
Stacktrace:
"caller":"exporterhelper/queue_sender.go:101", "data_type":"traces", "dropped_items":200, "error":"context deadline exceeded", "kind":"exporter", "level":"error", "msg":"Exporting failed. Dropping data.", "name":"googlecloud", "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
Any ideas on what to do?
Component(s)
exporter/googlecloud
What happened?
Description
When Google Cloud Monitoring exporter fails to export metrics to Google Cloud Monitoring, it drops the data. This occurs even for transient errors where the attempt should be retried.
Steps to Reproduce
Configure collector, export demo metrics.
Expected Result
Metrics are reliably exported to Google Cloud Monitoring
Actual Result
Metrics are dropped. for transient errors (such as "Authentication unavalialbe" -- when the auth cookie expires and needs to be refreshed)
Collector version
0.93.0
Environment information
Environment
GKE
OpenTelemetry Collector configuration
Log output
Additional context
No response