Closed matthewmodestino closed 2 years ago
@jvoravong can you please take a look and see if it makes sense to update the timeout value?
Here are the timeout and retry configurations:
- timeout (default = 5s): Time to wait per individual attempt to send data to a backend.
- retry_on_failure -- enabled (default = true) -- initial_interval (default = 5s): Time to wait after the first failure before retrying -- max_interval (default = 30s): Is the upper bound on backoff -- max_elapsed_time (default = 300s): Is the maximum amount of time spent trying to send a batch
- sending_queue -- enabled (default = true) -- num_consumers (default = 10): Number of consumers that dequeue batches; ignored if enabled is false -- queue_size (default = 5000): Maximum number of batches kept in memory before dropping. User should calculate this as num_seconds * requests_per_second where: num_seconds is the number of seconds to buffer in case of a backend outage requests_per_second is the average number of requests per seconds.
When we start dropping metrics:
It's reasonable to increase the connection timeout to a value between 20-60s with no harmful affects.
Our users facing this issue could really use better default values or the ability to configure at least the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs. The size of the sending queue directly affects our memory usage, the retry_on_failure.max_elapsed_time affects how fast the sending queue will fill when the backend has pressure. Without some of sort of performance metrics or testing, I'm unsure what default values would be good for retry_on_failure and sending_queue configs.
Action Items I'd recommend:
Thanks Josh!
HTTP timeout 10s is pretty high already. If http connection cannot be established in 10s, it must be something wrong with the backend or in the middle. I don't think tuning timeout setting will help in these cases. I don't think we should expose this setting. Better to get https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6803 merged and expose the full set of http settings for those who really need them.
It is either the timeout, or the default maxconnections setting that causes this, I believe. Either way collector should be configured to not overwhelm the backend, so if we need testing to help validate I'd suggest the team get with the cloud eng and have a proper look. I have multiple customers who hit this...
Noticed the Splunk Hec Exporter defaults for max_content_length_logs and max_content_length_metrics are set to 2MiB.
Some of the Splunk Cloud docs say we should avoid having max content lengths above 1MB.
Splunk Cloud Platform service limits and constraints
HEC maximum content length size limit = 1 MB
There is a recommended limit to the HEC payload size in Splunk Cloud Platform to ensure data balance and ingestion fidelity. A HEC request can have one or more Splunk events batched into it but the payload size should be no larger than this limit. If you exceed this limit, you may experience performance issues related to data balance and ingestion fidelity.
For the Splunk Cloud Hec Exporters, I propose we set max_content_length_logs and max_content_length_metrics to equal 1MB.
Yeah thats a "soft" limit. Enterprise default max is 5MB. Keeping under 1MB is probably wise.
I have created a PR(https://github.com/signalfx/splunk-otel-collector-chart/pull/460) to expose batch, retry_on_failure and sending_queue config in values.yaml file. It will not fix this issue but users can update these configs according to their needs.
@harshit-splunk thanks. I am going to mark this issue as resolved, now that those settings are exposed. Matthew, please make sure to open a new issue if you see more timeout errors.
Hi team,
Have seen frequent reports from users of exporting failed due to
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
. Wondering if this is an opportunity to tune the chart timeout values to be more resilient when sending to Splunk Cloud or customer managed Splunk, especially in the event of any back pressure?https://github.com/signalfx/splunk-otel-collector-chart/blob/97dc6dca6f04a764a1fe48658453de72ed747a35/helm-charts/splunk-otel-collector/values.yaml#L47-L48
Here's an example of my collector sending from k8s to Splunk Cloud and hitting timeouts and eventually dropping metrics/logs
Eventually it continues to retry until: