google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.11k stars 2.32k forks source link

Spike in container_network_receive_bytes_total Metric Usage in a Kubernetes Cluster #3394

Open AnonC0DER opened 1 year ago

AnonC0DER commented 1 year ago

Summary:

I'm using cadvisor with Prometheus in multiple Kubernetes (k8s) clusters to monitor network traffic usage. I utilize the container_network_receive_bytes_total metric in a query to calculate the total network traffic usage. However, I'm encountering an unusual issue in one of the clusters.

Problem:

In one of my clusters, I have a non-production database that has been running smoothly for 20 days. However, the container_network_receive_bytes_total metric has shown a significant spike in usage, even though I am certain there is no increase in load. This issue is not isolated. I have encountered similar occurrences multiple times, and they all seem to happen in this particular cluster. I attempted numerous approaches to reproduce it, but I was unable to do so.

This is the query I'm using :

(
    sum (
        increase (
            container_network_transmit_bytes_total{namespace="TEST"}[2d]
        )
    ) by (node, cluster, namespace, pod)
) / 1000000000

And this is the spike : Screenshot from 2023-09-13 22-58-01

I believe the root cause of this issue lies within this cluster, but I am seeking guidance or clues on how to troubleshoot and resolve it.

janotav commented 1 week ago

@AnonC0DER have you figured out the problem?

I am seeing similar. It appears to impact only particular pod (from a 2-replica deployment). The main difference I see is that affected pod shows significant traffic on 3 different interfaces (cni0, ens3, flannel.1), while the remaining pod(s) show metrics for interface eth0. While I do not manage the underlying infrastructure I believe the networking configuration is the same for all nodes.