open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.88k stars 2.25k forks source link

[receiver/kubeletstats] k8s.node.network.io metric is missing #33993

Open alita1991 opened 1 month ago

alita1991 commented 1 month ago

Component(s)

receiver/kubeletstats

What happened?

Description

k8s.node.network.io metric is not collected, while others are (k8s.node.memory., k8s.node.filesystem., etc)

Steps to Reproduce

Provision the collector using the provided config in a K3S / OpenShift environment + ClusterRole RBAC with full access

Expected Result

k8s.node.network.io metric should be collected

Actual Result

k8s.node.network.io metric not found

Collector version

0.102.1

Environment information

3x AWS EC2 VMs + K3S (3 masters + 3 workers) 3x AWS EC2 VMS + OpenShift (3 masters + 3 workers)

OpenTelemetry Collector configuration

receivers:
  kubeletstats:
    templateEnabled: '{{ index .Values "mimir-distributed" "enabled" }}'
    collection_interval: 30s
    auth_type: "serviceAccount"
    endpoint: "${env:KUBELETSTATS_ENDPOINT}"
    extra_metadata_labels:
    - k8s.volume.type
    insecure_skip_verify: true
    metric_groups:
    - container
    - pod
    - volume
    - node
processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 50
    spike_limit_percentage: 10
  k8sattributes:
    auth_type: 'serviceAccount'
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.start_time
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.node.name
  resourcedetection/env:
    detectors:
    - env
  resource/remove_container_id:
    attributes:
    - action: delete
      key: container.id
    - action: delete
      key: container_id
exporters:
  logging:
    verbosity: detailed
  otlp:
    endpoint: '{{ template "central.collector.address" $ }}'
    tls:
      insecure: true
service:
  telemetry:
    metrics:
      address: "0.0.0.0:8888"
      level: detailed
  pipelines:
    metrics/kubeletstats:
      templateEnabled: '{{ index .Values "mimir-distributed" "enabled" }}'
      receivers: [kubeletstats]
      processors: [k8sattributes, resourcedetection/env, resource/remove_container_id, memory_limiter, batch]
      exporters: [otlp]

Log output

No errors were found in the log

Additional context

Before opening the ticket, I did some debugging, but I could not find any relevant information in debug mode, I'm trying to understand why this specific metric is not collected and what I can do to investigate the problem more.

Is important to mention that the k8s_pod_network_io_bytes_total metric was collected by the receiver.

github-actions[bot] commented 1 month ago

Pinging code owners:

ChrsMark commented 1 month ago

Hey @alita1991! I tried to reproduce this but I wasn't able on GKE or EKS.

I'm using the following Helm chart values:

mode: daemonset
presets:
  kubeletMetrics:
    enabled: true

config:
  exporters:
    debug:
      verbosity: normal
  receivers:
    kubeletstats:
      collection_interval: 10s
      auth_type: 'serviceAccount'
      endpoint: '${env:K8S_NODE_NAME}:10250'
      insecure_skip_verify: true
      metrics:
        k8s.node.network.io:
          enabled: true

  service:
    pipelines:
      metrics:
        receivers: [kubeletstats]
        processors: [batch]
        exporters: [debug]

And deploy the Collector with helm install daemonset open-telemetry/opentelemetry-collector --set image.repository="otel/opentelemetry-collector-k8s" --set image.tag="0.104.0" --values ds_k8s_metrics.yaml

GKE

v1.29.4-gke.1043004

> k logs -f daemonset-opentelemetry-collector-agent-24x6f | grep k8s.node.network.io
k8s.node.network.io{interface=eth0,direction=receive} 2508490408
k8s.node.network.io{interface=eth0,direction=transmit} 1329730075
k8s.node.network.io{interface=eth0,direction=receive} 2541570721
k8s.node.network.io{interface=eth0,direction=transmit} 1330038333
k8s.node.network.io{interface=eth0,direction=receive} 2541728902
k8s.node.network.io{interface=eth0,direction=transmit} 1330216803
k8s.node.network.io{interface=eth0,direction=receive} 2541792120
k8s.node.network.io{interface=eth0,direction=transmit} 1330323914
k8s.node.network.io{interface=eth0,direction=receive} 2541974411
k8s.node.network.io{interface=eth0,direction=transmit} 1330557979

EKS

v1.30.0-eks-036c24b

> k logs -f daemonset-opentelemetry-collector-agent-58csx | grep k8s.node.network.io
k8s.node.network.io{interface=eth0,direction=receive} 7511134123
k8s.node.network.io{interface=eth0,direction=transmit} 21146466749
k8s.node.network.io{interface=eth0,direction=receive} 7545084343
k8s.node.network.io{interface=eth0,direction=transmit} 21146550460
k8s.node.network.io{interface=eth0,direction=receive} 7545094892
k8s.node.network.io{interface=eth0,direction=transmit} 21146552331

I suggest you verify what the /stats/summary endpoint provide. I suspect it gives no values for this metric to be exported or sth weird. You can run the following debug Pod to get this info. Note that you need to use the same service account that the Collector uses (if the Collector is already running) in order to get access to this endpoint (in my case it was named daemonset-opentelemetry-collector):

kubectl run my-shell --rm -i --tty --image=ubuntu --overrides='{ "apiVersion": "v1", "spec": { "serviceAccountName": "daemonset-opentelemetry-collector", "hostNetwork": true }  }' -- bash
apt update
apt-get install curl jq
export token=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) && curl -H "Authorization: Bearer $token" https://$HOSTNAME:10250/stats/summary --insecure

In my case it gave:

{
  "time": "2024-07-10T09:34:01Z",
  "name": "eth0",
  "rxBytes": 3234464903,
  "rxErrors": 0,
  "txBytes": 1197870852,
  "txErrors": 0,
  "interfaces": [
    {
      "name": "eth0",
      "rxBytes": 3234464903,
      "rxErrors": 0,
      "txBytes": 1197870852,
      "txErrors": 0
    }
  ]
}
alita1991 commented 1 month ago

Hi,

I tested using your config and got 0 data points for k8s.node.network.io, what could it be? I don't have any RBAC-related issues in the logs.

kubectl logs daemonset-opentelemetry-collector-agent-dw7hf | grep k8s.node.network.io | wc -l
0

For k8s.pod.network.io, is working like expected:

kubectl logs daemonset-opentelemetry-collector-agent-dw7hf | grep k8s.pod.network.io | wc -l
2546

I also tested the scrape via curl, here is the result for one of the nodes:

"node":{
"network":{
"time":"2024-07-10T12:42:59Z",
"name":"",
"interfaces":[
{
"name":"ens5",
"rxBytes":481114242884,
"rxErrors":0,
"txBytes":715126064226,
"txErrors":0
},
{
"name":"ovs-system",
"rxBytes":0,
"rxErrors":0,
"txBytes":0,
"txErrors":0
},
{
"name":"ovn-k8s-mp0",
"rxBytes":5821168746,
"rxErrors":0,
"txBytes":47539598446,
"txErrors":0
},
{
"name":"genev_sys_6081",
"rxBytes":265742543652,
"rxErrors":0,
"txBytes":370984422928,
"txErrors":0
},
ChrsMark commented 1 month ago

Thank's @alita1991 for checking this!

It seems that in your case the top level info is missing compared to what I see:

{
  "time": "2024-07-10T09:34:01Z",
  "name": "eth0",
  "rxBytes": 3234464903,
  "rxErrors": 0,
  "txBytes": 1197870852,
  "txErrors": 0,
  "interfaces": [
    {
      "name": "eth0",
      "rxBytes": 3234464903,
      "rxErrors": 0,
      "txBytes": 1197870852,
      "txErrors": 0
    }
  ]
}

Also removing these lines from the testing sample at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/8183bd9d091a032f7e9ba24f9adcaa4774d1ff1b/receiver/kubeletstatsreceiver/testdata/stats-summary.json#L78-L81 makes the unit tests to fail.

The missing information is about the default interface according to https://pkg.go.dev/k8s.io/kubelet@v0.29.3/pkg/apis/stats/v1alpha1#NetworkStats.

Indeed, checking the code it seems that we only extract the top level tx/rx metrics: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.104.0/receiver/kubeletstatsreceiver/internal/kubelet/network.go#L24-L42.

So the question here is if we should consider it as a bug/limitation and expand in order to collect metrics for all of the interfaces instead of just the default. Note that the Interfaces list includes the default, so just by iterating this we will have the default's interface metrics included.

I would like to hear what @TylerHelmuth and @dmitryax think here.

Update: I see it was already reported for pod's metrics at https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30196