Closed DavidS-ovm closed 7 months ago
I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.
How does the growth compare to its requests/limits? Is it ever crashing?
I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.
From the freshly-started state not needing any kind of CPU (and the overall nature of what the agent does) this behaviour is suspect, irrespective of limits set.
How does the growth compare to its requests/limits? Is it ever crashing?
As explained in the description, this agent is deployed using all the defaults from the helm chart, whatever they might be. Assuming that the pod would recycle on a crash, it doesn't seem like it did crash in the last 28 days.
The other thing I'm noticing is that there are bursts of extremely high latency reported for the kubernetes-logs dataset:
I should also have noted in the original report that this cluster is running on eks with relatively low overall load. Here's the overall node CPU usage for the cluster:
The correlation between the agent CPU use, when the second node (and hence the second agent) came online and when I restarted it is clear to see.
@DavidS-om I will keep investigating this, but note that the helm chart sets no default requests/limits: https://github.com/honeycombio/helm-charts/blob/0cba10473077edc7fbf56d3259ac6d135a67e4cb/charts/honeycomb/values.yaml#LL92-L98C19. Best practice is to set those values.
@TylerHelmuth I can see how that would keep this from impacting the rest of my cluster, but assuming that k8s only throttles, not kills, pods for hitting limits, that'll likely just mean that I'm not gonna get (timely) logs and metrics from the agent.
That reminded me to check the event latency of the metrics dataset, and something curious is happening there:
Since restarting the pod last week, latency for metrics also through the roof
since we stopped having a bunch of crash-looping pods in our test systems, the cpu and memory usage of the agent remained flat at a very low level:
Looks like this was resolved.
Versions
Steps to reproduce
deploy the agent through the helm chart:
Wait
Additional context
Here you can see the cpu usage of the two honeycomb agents (one per node) over the last 28 days. As you can see, the cpu usage is steadily increasing:
whereas the actual collected logs are not growing in volume:
deleting the agent's pod and letting k8s redeploy a new one restarts the growth at 0.
There is nothing obvious in the logs.
This is a development cluster that is not seeing a lot of log traffic, but a lot of k8s traffic (i.e. pods getting replaced, see deployment markers).