Open Rommmmm opened 3 months ago
Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.
Pinging code owners for exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr @ankitpatel96. See Adding Labels via Comments if you do not have permissions to add labels yourself.
@Rommmmm could you try upgrade to v0.102.0 and see if the issue persists? This should have been fixed in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33291.
Hi, I'm seeing the same issue. And updating to v0.102 doesn't help, we are still losing metrics
@songy23 sorry for taking so long but unfortunately upgrading didn't help
@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?
The mention of terminationGracePeriodSeconds
makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).
@Rommmmm Does the collector not gracefully shut down at all, or is it being killed before it can shut down gracefully?
The mention of
terminationGracePeriodSeconds
makes it sound like the latter, which may be user error (i.e. a process can't finish it's exit routine if it is forcefully interrupted and killed in the middle of it).
Its not gracefully shut down
@Rommmmm is it being killed or terminated? Processes being killed is not a graceful shutdown scenario AFAIK.
What I'm guessing is happening is that your terminationGracePeriodSeconds
is too short, so while the process is shutting down gracefully (e.g. flushing queued data to a vendor backend), the control plane simply kills it since it took too long.
Component(s)
datadogexporter
What happened?
Description
We are currently experiencing an issue with the OpenTelemetry Collector running in our Kubernetes cluster, which is managed by Karpenter. Our setup involves spot instances, and we've noticed that when Karpenter terminates these instances, the OpenTelemetry Collector does not seem to shut down gracefully. Consequently, we are losing metrics and traces that are presumably still in the process of being processed or exported.
Steps to Reproduce
Expected Result
The OpenTelemetry Collector should flush all pending metrics and traces before shutting down to ensure no data is lost during spot instance termination.
Actual Result
During a spot termination event triggered by Karpenter, the OpenTelemetry Collector shuts down without flushing all the data, causing loss of metrics and traces.
Collector version
0.95.0
Environment information
Environment
Kubernetes Version: 1.27 Karpenter Version: 0.35.2 Cloud Provider: AWS
OpenTelemetry Collector configuration
Log output
No response
Additional context
I noticed that there is a terminationGracePeriodSeconds configuration in Kubernetes deployment that can give workloads more time to shutdown. However, this option does not seem to be exposed in the OpenTelemetry Collector Helm chart.
I would like to suggest the following enhancements: