microsoft / retina

eBPF distributed networking observability tool for Kubernetes
https://retina.sh
MIT License
2.74k stars 211 forks source link

Crash: inconsistent label cardinality #838

Open matmerr opened 1 month ago

matmerr commented 1 month ago

Describe the bug

Retina experienced crash on start with image: mcr.microsoft.com/containernetworking/retina-agent:v0.0.15

To Reproduce Steps to reproduce the behavior:

Defaulted container "retina" out of: retina, init-retina (init)
Starting Retina Agent
starting Retina daemon with legacy control plane v0.0.15
init client-go
api server:  ****
init logger
ts=2024-08-28T16:33:20.448Z level=info caller=metrics/metrics.go:169 msg="Metrics initialized"
ts=2024-08-28T16:33:20.448Z level=info caller=legacy/daemon.go:137 msg="{data aggregation level 15 0 low <nil>}"
ts=2024-08-28T16:33:20.449Z level=info caller=legacy/daemon.go:173 msg="Remote context is disabled, only pods deployed on the same node as retina-agent will be monitored"
ts=2024-08-28T16:33:20.449Z level=info caller=legacy/daemon.go:194 msg="pod selector when remote context is disabled{pod selector 15 0 spec.nodeName=aks-nodepool1-47685273-vmss000001,status.podIP!=10.0.0.4 <nil>}"
ts=2024-08-28T16:33:20.478Z level=info caller=legacy/daemon.go:226 msg="Kubernetes server version: v1.27.16"
ts=2024-08-28T16:33:20.478Z level=info caller=legacy/daemon.go:248 msg="Initializing Pod controller"
ts=2024-08-28T16:33:20.478Z level=info caller=pod/controller.go:89 msg="Setting up Pod controller"
ts=2024-08-28T16:33:20.478Z level=info caller=legacy/daemon.go:264 msg="Initializing Node controller"
ts=2024-08-28T16:33:20.478Z level=info caller=node/controller.go:89 msg="Setting up Node controller"
ts=2024-08-28T16:33:20.479Z level=info caller=legacy/daemon.go:270 msg="Initializing Service controller"
ts=2024-08-28T16:33:20.479Z level=info caller=service/controller.go:95 msg="Setting up Service controller"
ts=2024-08-28T16:33:20.479Z level=info caller=legacy/daemon.go:277 msg="Initializing MetricsConfig namespaceController"
ts=2024-08-28T16:33:20.479Z level=info caller=namespace/namespace_controller.go:105 msg="Setting up Namespace controller"
ts=2024-08-28T16:33:20.479Z level=info caller=pluginmanager/pluginmanager.go:67 msg="plugin manager has pod level enabled"
ts=2024-08-28T16:33:20.479Z level=info caller=controllermanager/controllermanager.go:79 msg="Initializing controller manager ..."
ts=2024-08-28T16:33:20.479Z level=info caller=servermanager/servermanager.go:33 msg="Initializing HTTP server ..."
ts=2024-08-28T16:33:20.479Z level=info caller=server/server.go:42 msg="Setting up handlers"
ts=2024-08-28T16:33:20.479Z level=info caller=server/server.go:57 msg="Completed handler setup"
ts=2024-08-28T16:33:20.479Z level=info caller=servermanager/servermanager.go:37 msg="HTTP server initialized..."
ts=2024-08-28T16:33:20.479Z level=info caller=legacy/daemon.go:309 msg="Started controller manager"
ts=2024-08-28T16:33:20.479Z level=info caller=servermanager/servermanager.go:42 msg="Starting HTTP server ..." host=0.0.0.0 port=10093
ts=2024-08-28T16:33:20.479Z level=info caller=server/server.go:208 msg="Starting metrics server"
ts=2024-08-28T16:33:20.480Z level=info caller=server/server.go:69 msg="starting HTTP server... on " addr=0.0.0.0:10093
ts=2024-08-28T16:33:20.480Z level=info caller=pluginmanager/pluginmanager.go:138 msg="Starting plugin manager ..."
ts=2024-08-28T16:33:20.480Z level=info caller=pluginmanager/pluginmanager.go:150 msg="starting watchers"
ts=2024-08-28T16:33:20.480Z level=info caller=watchermanager/watchermanager.go:44 msg="watcher started" watcher_type=*endpoint.EndpointWatcher
ts=2024-08-28T16:33:20.480Z level=info caller=watchermanager/watchermanager.go:44 msg="watcher started" watcher_type=*apiserver.ApiServerWatcher
ts=2024-08-28T16:33:20.481Z level=info caller=manager/server.go:83 msg="starting server" name="health probe" addr=[::]:18081
ts=2024-08-28T16:33:20.481Z level=info caller=controller/controller.go:173 msg="Starting EventSource" controller=pod controllerGroup= controllerKind=Pod source="kind source: *v1.Pod"
ts=2024-08-28T16:33:20.481Z level=info caller=controller/controller.go:173 msg="Starting EventSource" controller=service controllerGroup= controllerKind=Service source="kind source: *v1.Service"
ts=2024-08-28T16:33:20.481Z level=info caller=controller/controller.go:173 msg="Starting EventSource" controller=node controllerGroup= controllerKind=Node source="kind source: *v1.Node"
ts=2024-08-28T16:33:20.481Z level=info caller=controller/controller.go:181 msg="Starting Controller" controller=pod controllerGroup= controllerKind=Pod
ts=2024-08-28T16:33:20.481Z level=info caller=controller/controller.go:173 msg="Starting EventSource" controller=namespace controllerGroup= controllerKind=Namespace source="kind source: *v1.Namespace"
ts=2024-08-28T16:33:20.482Z level=info caller=controller/controller.go:181 msg="Starting Controller" controller=node controllerGroup= controllerKind=Node
ts=2024-08-28T16:33:20.481Z level=info caller=server/server.go:247 msg="Serving metrics server" bindAddress=:18080 secure=false
ts=2024-08-28T16:33:20.482Z level=info caller=controller/controller.go:181 msg="Starting Controller" controller=service controllerGroup= controllerKind=Service
ts=2024-08-28T16:33:20.482Z level=info caller=controller/controller.go:181 msg="Starting Controller" controller=namespace controllerGroup= controllerKind=Namespace
ts=2024-08-28T16:33:20.582Z level=info caller=controller/controller.go:215 msg="Starting workers" controller=node controllerGroup= controllerKind=Node worker count=1
ts=2024-08-28T16:33:20.583Z level=info caller=controller/controller.go:215 msg="Starting workers" controller=service controllerGroup= controllerKind=Service worker count=1
ts=2024-08-28T16:33:20.583Z level=info caller=controller/controller.go:215 msg="Starting workers" controller=pod controllerGroup= controllerKind=Pod worker count=1
ts=2024-08-28T16:33:20.583Z level=info caller=controller/controller.go:215 msg="Starting workers" controller=namespace controllerGroup= controllerKind=Namespace worker count=1
ts=2024-08-28T16:33:20.583Z level=info caller=pod/controller.go:39 msg="Reconciling Pod" Pod=kube-system/konnectivity-agent-644d74dc99-n42g2
ts=2024-08-28T16:33:20.584Z level=info caller=pod/controller.go:39 msg="Reconciling Pod" Pod=kube-system/coredns-autoscaler-7bf9ddd894-xfsp9
ts=2024-08-28T16:33:20.584Z level=info caller=pod/controller.go:39 msg="Reconciling Pod" Pod=kube-system/coredns-d696dc46-flbv7
ts=2024-08-28T16:33:20.585Z level=info caller=pod/controller.go:39 msg="Reconciling Pod" Pod=kube-system/coredns-d696dc46-n8nqk
ts=2024-08-28T16:33:21.480Z level=info caller=metrics/metrics_module.go:157 msg="Reconciling metric module" spec= specError="unsupported value type"
ts=2024-08-28T16:33:21.480Z level=info caller=metrics/forward.go:41 msg="Creating forward count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.480Z level=info caller=metrics/forward.go:41 msg="Creating forward count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.480Z level=info caller=metrics/drops.go:38 msg="Creating drop count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.480Z level=info caller=metrics/drops.go:38 msg="Creating drop count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.480Z level=info caller=metrics/tcpflags.go:37 msg="Creating TCP Flags count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.480Z level=info caller=metrics/tcpretrans.go:37 msg="Creating TCP retransmit count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/dns.go:44 msg="Creating DNS count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/dns.go:44 msg="Creating DNS count metrics" options= optionsError="unsupported value type"
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/drops.go:72 msg="src labels" labels=reason,direction,ip,namespace,podname,workload_kind,workload_name
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/dns.go:76 msg="src labels" labels=query_type,query,ip,namespace,podname,workload_kind,workload_name
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/dns.go:91 msg="src labels" labels=return_code,query_type,query,response,num_response,ip,namespace,podname,workload_kind,workload_name
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/forward.go:78 msg="src labels" labels=direction,ip,namespace,podname,workload_kind,workload_name
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/drops.go:72 msg="src labels" labels=reason,direction,ip,namespace,podname,workload_kind,workload_name
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/tcpretrans.go:57 msg="src labels" labels=direction,ip,namespace,podname,workload_kind,workload_name
ts=2024-08-28T16:33:21.481Z level=info caller=metrics/forward.go:78 msg="src labels" labels=direction,ip,namespace,podname,workload_kind,workload_name
ts=2024-08-28T16:33:21.883Z level=info caller=dropreason/dropreason_linux.go:115 msg="DropReason metric compiled"
ts=2024-08-28T16:33:22.793Z level=info caller=common/common_linux.go:74 msg="perf reader created" Map=PerfEventArray(dropreason_events)#14 PageSize=4096 BufferSize=65536
ts=2024-08-28T16:33:23.648Z level=info caller=pluginmanager/pluginmanager.go:122 msg="Reconciled plugin" name=dropreason
ts=2024-08-28T16:33:23.648Z level=info caller=pluginmanager/pluginmanager.go:173 msg="starting plugin dropreason"
ts=2024-08-28T16:33:23.648Z level=info caller=dropreason/dropreason_linux.go:241 msg="Start listening for drop reason events..."
ts=2024-08-28T16:33:23.648Z level=info caller=dropreason/dropreason_linux.go:247 msg="setting up enricher since pod level is enabled"
ts=2024-08-28T16:33:25.649Z level=info caller=packetforward/packetforward_linux.go:102 msg="Packet forwarding metric header generated"
ts=2024-08-28T16:33:26.547Z level=info caller=packetforward/packetforward_linux.go:130 msg="Packet forwarding metric compiled"
ts=2024-08-28T16:33:26.549Z level=info caller=packetforward/packetforward_linux.go:176 msg="Packet forwarding metric initialized"
ts=2024-08-28T16:33:26.549Z level=info caller=pluginmanager/pluginmanager.go:122 msg="Reconciled plugin" name=packetforward
ts=2024-08-28T16:33:26.549Z level=info caller=pluginmanager/pluginmanager.go:173 msg="starting plugin packetforward"
ts=2024-08-28T16:33:26.549Z level=info caller=packetforward/packetforward_linux.go:181 msg="Start collecting packet forward metrics"
ts=2024-08-28T16:33:28.550Z level=info caller=linuxutil/linuxutil_linux.go:42 msg="Initializing linuxutil plugin..."
ts=2024-08-28T16:33:28.550Z level=info caller=pluginmanager/pluginmanager.go:122 msg="Reconciled plugin" name=linuxutil
ts=2024-08-28T16:33:28.550Z level=info caller=pluginmanager/pluginmanager.go:173 msg="starting plugin linuxutil"
ts=2024-08-28T16:33:28.550Z level=info caller=linuxutil/linuxutil_linux.go:57 msg="Running linuxutil plugin..."
ts=2024-08-28T16:33:30.550Z level=info caller=dns/dns_linux.go:90 msg="Stopped dns plugin"
ts=2024-08-28T16:33:30.586Z level=info caller=dns/dns_linux.go:64 msg="Initialized dns plugin"
ts=2024-08-28T16:33:30.587Z level=info caller=pluginmanager/pluginmanager.go:122 msg="Reconciled plugin" name=dns
ts=2024-08-28T16:33:30.587Z level=info caller=pluginmanager/pluginmanager.go:173 msg="starting plugin dns"
ts=2024-08-28T16:33:32.587Z level=info caller=packetparser/packetparser_linux.go:82 msg="PacketParser header generated at" path=/go/src/github.com/microsoft/retina/pkg/plugin/packetparser/_cprog/dynamic.h
ts=2024-08-28T16:33:33.575Z level=info caller=packetparser/packetparser_linux.go:108 msg="PacketParser metric compiled"
ts=2024-08-28T16:33:33.575Z level=info caller=packetparser/packetparser_linux.go:241 msg="Stopping packet parser"
ts=2024-08-28T16:33:33.575Z level=info caller=packetparser/packetparser_linux.go:283 msg="Stopped packet parser"
ts=2024-08-28T16:33:34.050Z level=info caller=common/common_linux.go:74 msg="perf reader created" Map=PerfEventArray(packetparser_events)#67 PageSize=4096 BufferSize=131072
ts=2024-08-28T16:33:34.050Z level=info caller=pluginmanager/pluginmanager.go:122 msg="Reconciled plugin" name=packetparser
ts=2024-08-28T16:33:34.050Z level=info caller=pluginmanager/pluginmanager.go:173 msg="starting plugin packetparser"
ts=2024-08-28T16:33:34.050Z level=info caller=packetparser/packetparser_linux.go:190 msg="Starting packet parser"
ts=2024-08-28T16:33:34.050Z level=info caller=packetparser/packetparser_linux.go:192 msg="setting up enricher since pod level is enabled"
ts=2024-08-28T16:33:34.051Z level=info caller=packetparser/packetparser_linux.go:212 msg="Attaching bpf program to default interface of k8s Node in node namespace"
ts=2024-08-28T16:33:34.051Z level=info caller=packetparser/packetparser_linux.go:223 msg="Attaching Packetparser" outgoingLink.Index=2 outgoingLink.Name=eth0 outgoingLink.HardwareAddr=00:0d:3a:f7:c4:96
ts=2024-08-28T16:33:34.115Z level=info caller=packetparser/packetparser_linux.go:513 msg="Started packet parser"
ts=2024-08-28T16:33:36.051Z level=info caller=pluginmanager/pluginmanager.go:181 msg="successfully started pluginmanager"
ts=2024-08-28T16:33:50.484Z level=info caller=apiserver/apiserver.go:118 msg="New Apiserver IPs:" ip=4.154.192.90
panic: inconsistent label cardinality: expected 0 label values but got 1 in []string{"no_response"}

goroutine 449 [running]:
github.com/prometheus/client_golang/prometheus.(*CounterVec).WithLabelValues(0x31de654?, {0xc0008c7690?, 0x2faff00?, 0xc00374b840?})
        /go/pkg/mod/github.com/prometheus/client_golang@v1.20.2/prometheus/counter.go:284 +0x34
github.com/microsoft/retina/pkg/module/metrics.(*LatencyMetrics).Init.func1({0xc0011a27d0?, 0x4499bd?}, 0x23e69c2?, 0x2962500?)
        /go/src/github.com/microsoft/retina/pkg/module/metrics/latency.go:127 +0x19b
github.com/jellydator/ttlcache/v3.(*Cache[...]).OnEviction.func1.1()
        /go/pkg/mod/github.com/jellydator/ttlcache/v3@v3.3.0/cache.go:699 +0x31
created by github.com/jellydator/ttlcache/v3.(*Cache[...]).OnEviction.func1 in goroutine 271
        /go/pkg/mod/github.com/jellydator/ttlcache/v3@v3.3.0/cache.go:698 +0xef

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

NAME                                STATUS                     ROLES   AGE   VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-47685273-vmss000000   Ready                      agent   47d   v1.27.16   10.0.0.5      <none>        Ubuntu 22.04.4 LTS               5.15.0-1070-azure   containerd://1.7.20-1
aks-nodepool1-47685273-vmss000001   Ready                      agent   47d   v1.27.16   10.0.0.4      <none>        Ubuntu 22.04.4 LTS               5.15.0-1070-azure   containerd://1.7.20-1

Additional context Add any other context about the problem here.

nddq commented 1 month ago

should be fixed in 0.0.16 https://github.com/microsoft/retina/pull/687