honeycombio / honeycomb-kubernetes-agent

Application visibility for Kubernetes.
https://honeycomb.io
Apache License 2.0
58 stars 30 forks source link

run-away cpu usage #352

Closed DavidS-ovm closed 7 months ago

DavidS-ovm commented 1 year ago

Versions

Image:          honeycombio/honeycomb-kubernetes-agent:2.6.0
Image ID:       docker.io/honeycombio/honeycomb-kubernetes-agent@sha256:1f68553ba8db5c86a48355f288a97485905d75bf81c919064c6fc864316ba182                                                                                                                                                                                                             

Steps to reproduce

  1. deploy the agent through the helm chart:

    resource "helm_release" "honeycomb" {
    name       = "honeycomb"
    repository = "https://honeycombio.github.io/helm-charts"
    chart      = "honeycomb"
    timeout    = 60
    values = [yamlencode({
    honeycomb = { apiKey = var.honeycomb_api_key },
    metrics = {
      enabled      = true
      interval     = 1 * 60 * 1000 * 1000 * 1000 # nanoseconds
      clusterName  = "k8s-${var.terraform_env_name}"
      metricGroups = ["node", "pod", "volume"]
    },
    watchers = [ /* 10 watchers */ ]
  2. Wait

Additional context

Here you can see the cpu usage of the two honeycomb agents (one per node) over the last 28 days. As you can see, the cpu usage is steadily increasing: 2023-03-17_09MS+0100_1794x888

whereas the actual collected logs are not growing in volume: 2023-03-17_09MS+0100_1804x877

deleting the agent's pod and letting k8s redeploy a new one restarts the growth at 0.

There is nothing obvious in the logs.

This is a development cluster that is not seeing a lot of log traffic, but a lot of k8s traffic (i.e. pods getting replaced, see deployment markers).

TylerHelmuth commented 1 year ago

I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.

How does the growth compare to its requests/limits? Is it ever crashing?

DavidS-ovm commented 1 year ago

I've not experienced this type of growth with the k8s agent but have seen it with the collector and other k8s deployments. I've never seen this pattern actually hit the limits of the deployment though.

From the freshly-started state not needing any kind of CPU (and the overall nature of what the agent does) this behaviour is suspect, irrespective of limits set.

How does the growth compare to its requests/limits? Is it ever crashing?

As explained in the description, this agent is deployed using all the defaults from the helm chart, whatever they might be. Assuming that the pod would recycle on a crash, it doesn't seem like it did crash in the last 28 days.

The other thing I'm noticing is that there are bursts of extremely high latency reported for the kubernetes-logs dataset:

2023-03-20_08MS+0100_774x591

I should also have noted in the original report that this cluster is running on eks with relatively low overall load. Here's the overall node CPU usage for the cluster:

2023-03-20_08MS+0100_1800x833

The correlation between the agent CPU use, when the second node (and hence the second agent) came online and when I restarted it is clear to see.

TylerHelmuth commented 1 year ago

@DavidS-om I will keep investigating this, but note that the helm chart sets no default requests/limits: https://github.com/honeycombio/helm-charts/blob/0cba10473077edc7fbf56d3259ac6d135a67e4cb/charts/honeycomb/values.yaml#LL92-L98C19. Best practice is to set those values.

DavidS-ovm commented 1 year ago

@TylerHelmuth I can see how that would keep this from impacting the rest of my cluster, but assuming that k8s only throttles, not kills, pods for hitting limits, that'll likely just mean that I'm not gonna get (timely) logs and metrics from the agent.

That reminded me to check the event latency of the metrics dataset, and something curious is happening there:

2023-03-20_15MS+0100_717x448

Since restarting the pod last week, latency for metrics also through the roof

DavidS-ovm commented 1 year ago

since we stopped having a bunch of crash-looping pods in our test systems, the cpu and memory usage of the agent remained flat at a very low level:

2023-03-27_16MS+0200_3724x1168

MikeGoldsmith commented 7 months ago

Looks like this was resolved.