"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

matthewmodestino commented 2 years ago

Hi team,

Have seen frequent reports from users of exporting failed due to context deadline exceeded (Client.Timeout exceeded while awaiting headers). Wondering if this is an opportunity to tune the chart timeout values to be more resilient when sending to Splunk Cloud or customer managed Splunk, especially in the event of any back pressure?

https://github.com/signalfx/splunk-otel-collector-chart/blob/97dc6dca6f04a764a1fe48658453de72ed747a35/helm-charts/splunk-otel-collector/values.yaml#L47-L48

Here's an example of my collector sending from k8s to Splunk Cloud and hitting timeouts and eventually dropping metrics/logs

2022-04-06T22:09:21.013Z    info    exporterhelper/queued_retry.go:215  Exporting failed. Will retry the request after interval.    {"kind": "exporter", "name": "splunk_hec/platform_logs", "error": "Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "9.43516099s"}
2022-04-06T22:09:27.279Z    info    exporterhelper/queued_retry.go:215  Exporting failed. Will retry the request after interval.    {"kind": "exporter", "name": "splunk_hec/platform_metrics", "error": "Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "3.892538109s"}

Eventually it continues to retry until:

2022-04-06T22:16:07.084Z    error   exporterhelper/queued_retry_inmemory.go:106 Exporting failed. No more retries left. Dropping data.  {"kind": "exporter", "name": "splunk_hec/platform_logs", "error": "max elapsed time expired Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 4615}
go.opentelemetry.io/collector/exporter/exporterhelper.onTemporaryFailure
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry_inmemory.go:106
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry.go:199
go.opentelemetry.io/collector/exporter/exporterhelper.(*logsExporterWithObservability).send
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/logs.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/internal/bounded_memory_queue.go:78

2022-04-06T22:16:10.201Z    error   exporterhelper/queued_retry_inmemory.go:106 Exporting failed. No more retries left. Dropping data.  {"kind": "exporter", "name": "splunk_hec/platform_metrics", "error": "max elapsed time expired Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 478}
go.opentelemetry.io/collector/exporter/exporterhelper.onTemporaryFailure
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry_inmemory.go:106
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry.go:199
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/metrics.go:133
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
    /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/collector@v0.47.0/exporter/exporterhelper/internal/bounded_memory_queue.go:78

emaderer commented 2 years ago

@jvoravong can you please take a look and see if it makes sense to update the timeout value?

jvoravong commented 2 years ago

Here are the timeout and retry configurations:

timeout (default = 5s): Time to wait per individual attempt to send data to a backend.

retry_on_failure -- enabled (default = true) -- initial_interval (default = 5s): Time to wait after the first failure before retrying -- max_interval (default = 30s): Is the upper bound on backoff -- max_elapsed_time (default = 300s): Is the maximum amount of time spent trying to send a batch

sending_queue -- enabled (default = true) -- num_consumers (default = 10): Number of consumers that dequeue batches; ignored if enabled is false -- queue_size (default = 5000): Maximum number of batches kept in memory before dropping. User should calculate this as num_seconds * requests_per_second where: num_seconds is the number of seconds to buffer in case of a backend outage requests_per_second is the average number of requests per seconds.

When we start dropping metrics:

If we have failed to send data to the backend after 300s of retrying, drop the data.
If we have more than 5000 batches of data retrying in the sending queue, start dropping batches retrying from the queue.

It's reasonable to increase the connection timeout to a value between 20-60s with no harmful affects.

Our users facing this issue could really use better default values or the ability to configure at least the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs. The size of the sending queue directly affects our memory usage, the retry_on_failure.max_elapsed_time affects how fast the sending queue will fill when the backend has pressure. Without some of sort of performance metrics or testing, I'm unsure what default values would be good for retry_on_failure and sending_queue configs.

Action Items I'd recommend:

Increase the connection timeout to 20s.
Expose the retry_on_failure and sending_queue configurations so they are configurable within the values.yaml. Users would still use default values, but this way users have full control. Users who are willing to use more memory can configure higher values for the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs.

emaderer commented 2 years ago

Thanks Josh!

dmitryax commented 2 years ago

HTTP timeout 10s is pretty high already. If http connection cannot be established in 10s, it must be something wrong with the backend or in the middle. I don't think tuning timeout setting will help in these cases. I don't think we should expose this setting. Better to get https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6803 merged and expose the full set of http settings for those who really need them.

matthewmodestino commented 2 years ago

It is either the timeout, or the default maxconnections setting that causes this, I believe. Either way collector should be configured to not overwhelm the backend, so if we need testing to help validate I'd suggest the team get with the cloud eng and have a proper look. I have multiple customers who hit this...

jvoravong commented 2 years ago

Noticed the Splunk Hec Exporter defaults for max_content_length_logs and max_content_length_metrics are set to 2MiB.

Some of the Splunk Cloud docs say we should avoid having max content lengths above 1MB.

Splunk Cloud Platform service limits and constraints HEC maximum content length size limit = 1 MB
There is a recommended limit to the HEC payload size in Splunk Cloud Platform to ensure data balance and ingestion fidelity. A HEC request can have one or more Splunk events batched into it but the payload size should be no larger than this limit. If you exceed this limit, you may experience performance issues related to data balance and ingestion fidelity.

For the Splunk Cloud Hec Exporters, I propose we set max_content_length_logs and max_content_length_metrics to equal 1MB.

matthewmodestino commented 2 years ago

Yeah thats a "soft" limit. Enterprise default max is 5MB. Keeping under 1MB is probably wise.

hvaghani221 commented 2 years ago

I have created a PR(https://github.com/signalfx/splunk-otel-collector-chart/pull/460) to expose batch, retry_on_failure and sending_queue config in values.yaml file. It will not fix this issue but users can update these configs according to their needs.

atoulme commented 2 years ago

@harshit-splunk thanks. I am going to mark this issue as resolved, now that those settings are exposed. Matthew, please make sure to open a new issue if you see more timeout errors.

signalfx / splunk-otel-collector

"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454