nirmata / kube-netc

A Kubernetes eBPF network monitor
Apache License 2.0
80 stars 16 forks source link

kube-netc restarts several times #7

Open JimBugwadia opened 4 years ago

JimBugwadia commented 4 years ago

from a test system:

ubuntu@ip-10-10-129-78:~$ kubectl get pods -A | grep kube-netc
default                         kube-netc-5hrkd                                              1/1     Running       25         3d7h
ubuntu@ip-10-10-129-78:~$
ubuntu@ip-10-10-129-78:~$ kubectl describe pod kube-netc-5hrkd
Name:           kube-netc-5hrkd
Namespace:      default
Priority:       0
Node:           ip-10-10-129-78.us-west-1.compute.internal/10.10.128.238
Start Time:     Thu, 11 Jun 2020 16:09:21 +0000
Labels:         controller-revision-hash=9f5c6789c
                name=kube-netc
                pod-template-generation=1
Annotations:    <none>
Status:         Running
IP:             10.244.0.138
IPs:            <none>
Controlled By:  DaemonSet/kube-netc
Containers:
  kube-netc:
    Container ID:   docker://2650292745d9226bc5fb185c03cdae315d3a72cfe5551a41d981bc1c696fee33
    Image:          nirmata/kube-netc
    Image ID:       docker-pullable://nirmata/kube-netc@sha256:6b754b4a759d7ca992a3b464d96613f1db3851d679fbbd63b63c23ced8714e21
    Port:           2112/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sun, 14 Jun 2020 21:07:34 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sun, 14 Jun 2020 19:07:26 +0000
      Finished:     Sun, 14 Jun 2020 21:07:32 +0000
    Ready:          True
    Restart Count:  25
    Environment:    <none>
    Mounts:
      /sys/fs/bpf from bpf (rw)
      /sys/fs/cgroup from cgroup (rw)
      /sys/kernel/debug from debug (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6dwlr (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  bpf:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/bpf
    HostPathType:
  cgroup:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/cgroup
    HostPathType:
  debug:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/kernel/debug
    HostPathType:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-6dwlr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6dwlr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:          <none>
drewrip commented 4 years ago

I haven't been able to observe any of these restarts myself. My kube-netc pod hasn't restarted. Would you be able to post the pod log?

JimBugwadia commented 4 years ago

Here is the pod data from a test system that shows multiple restarts:

ubuntu@ip-10-10-129-78:~$ kubectl describe pod 10 kube-netc-5hrkd
Name:           kube-netc-5hrkd
Namespace:      default
Priority:       0
Node:           ip-10-10-129-78.us-west-1.compute.internal/10.10.128.238
Start Time:     Thu, 11 Jun 2020 16:09:21 +0000
Labels:         controller-revision-hash=9f5c6789c
                name=kube-netc
                pod-template-generation=1
Annotations:    <none>
Status:         Running
IP:             10.244.0.138
IPs:            <none>
Controlled By:  DaemonSet/kube-netc
Containers:
  kube-netc:
    Container ID:   docker://5f118d6cf3dc2d4a9837aabddbd7f8445b4ff8572408d23059f2f4ba68565926
    Image:          nirmata/kube-netc
    Image ID:       docker-pullable://nirmata/kube-netc@sha256:6b754b4a759d7ca992a3b464d96613f1db3851d679fbbd63b63c23ced8714e21
    Port:           2112/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 19 Jun 2020 11:29:04 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 19 Jun 2020 07:57:23 +0000
      Finished:     Fri, 19 Jun 2020 11:29:02 +0000
    Ready:          True
    Restart Count:  81
    Environment:    <none>
    Mounts:
      /sys/fs/bpf from bpf (rw)
      /sys/fs/cgroup from cgroup (rw)
      /sys/kernel/debug from debug (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6dwlr (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  bpf:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/bpf
    HostPathType:
  cgroup:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/cgroup
    HostPathType:
  debug:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/kernel/debug
    HostPathType:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-6dwlr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6dwlr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:          <none>
Error from server (NotFound): pods "10" not found
JimBugwadia commented 4 years ago

Here are some more details from a test system:

[UPDATE] Pod nirmata-cni-installer-rppg9
        Labels: map[controller-revision-hash:3974475895 k8s-app:nirmata-cni-installer pod-template-generation:1 tier:node]
        IP: 10.10.128.238
Pod Update: 10.10.128.238 -> nirmata-cni-installer-rppg9
[UPDATE] Pod metrics-server-8457dfc7f5-lt7sd
        Labels: map[k8s-app:metrics-server pod-template-hash:4013897391]
        IP: 10.244.0.163
Pod Update: 10.244.0.163 -> metrics-server-8457dfc7f5-lt7sd
[UPDATE] Pod aws-node-ppsft
        Labels: map[controller-revision-hash:4159804558 k8s-app:aws-node pod-template-generation:1]
        IP: 10.10.128.238
fatal error: concurrent map read and map write

goroutine 12 [running]:
runtime.throw(0x186b967, 0x21)
        /usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc000072c38 sp=0xc000072c08 pc=0x434c92
runtime.mapaccess2_faststr(0x16258e0, 0xc00012a3c0, 0xc00fcbfd30, 0xc, 0xc000136658, 0xc00f760201)
        /usr/local/go/src/runtime/map_faststr.go:116 +0x47c fp=0xc000072ca8 sp=0xc000072c38 pc=0x41413c
github.com/nirmata/kube-netc/pkg/collector.StartCollector(0x26730e0, 0xc0000b1300)
        /build/pkg/collector/collector.go:29 +0x427 fp=0xc000072fd0 sp=0xc000072ca8 pc=0x14ba1d7
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc000072fd8 sp=0xc000072fd0 pc=0x464c71
created by main.main
        /build/main.go:28 +0xfa

goroutine 1 [IO wait, 60 minutes]:
internal/poll.runtime_pollWait(0x7f65bad4efc0, 0x72, 0x0)
        /usr/local/go/src/runtime/netpoll.go:203 +0x55
internal/poll.(*pollDesc).wait(0xc000122098, 0x72, 0x0, 0x0, 0x183b6ec)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Accept(0xc000122080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:384 +0x1d4
net.(*netFD).accept(0xc000122080, 0xc000107d20, 0x26f3ee0, 0x7f65e1a56108)
        /usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc0000bef20, 0xc000107d60, 0x40dde8, 0x30)
        /usr/local/go/src/net/tcpsock_posix.go:139 +0x32
net.(*TCPListener).Accept(0xc0000bef20, 0x1744e40, 0xc00012b4d0, 0x16158e0, 0x2667000)
        /usr/local/go/src/net/tcpsock.go:261 +0x64
net/http.(*Server).Serve(0xc00012c0e0, 0x1add8a0, 0xc0000bef20, 0x0, 0x0)
        /usr/local/go/src/net/http/server.go:2901 +0x25d
net/http.(*Server).ListenAndServe(0xc00012c0e0, 0xc00012c0e0, 0xc000107f58)
        /usr/local/go/src/net/http/server.go:2830 +0xb7
net/http.ListenAndServe(...)
        /usr/local/go/src/net/http/server.go:3086
main.main()
        /build/main.go:32 +0x1c7

goroutine 19 [chan receive]:
k8s.io/klog.(*loggingT).flushDaemon(0x26f2d80)
        /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:1010 +0x8b
created by k8s.io/klog.init.0
        /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:411 +0xd6
github.com/patrickmn/go-cache.(*janitor).Run(0xc0002575e0, 0xc0002a7c00)
        /go/pkg/mod/github.com/patrickmn/go-cache@v2.1.0+incompatible/cache.go:1079 +0xd5
created by github.com/patrickmn/go-cache.runJanitor
        /go/pkg/mod/github.com/patrickmn/go-cache@v2.1.0+incompatible/cache.go:1099 +0xa9

goroutine 21 [sync.Cond.Wait, 60 minutes]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc00032e250, 0x0)
        /usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc00032e240)
        /usr/local/go/src/sync/cond.go:56 +0x9d
github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc000120900, 0x0)
        /go/pkg/mod/github.com/cihub/seelog@v0.0.0-20151216151435-d2c6e5aa9fbf/behavior_asynclooplogger.go:50 +0x8a
github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc000120900)
        /go/pkg/mod/github.com/cihub/seelog@v0.0.0-20151216151435-d2c6e5aa9fbf/behavior_asynclooplogger.go:63 +0x42
created by github.com/cihub/seelog.NewAsyncLoopLogger
        /go/pkg/mod/github.com/cihub/seelog@v0.0.0-20151216151435-d2c6e5aa9fbf/behavior_asynclooplogger.go:40 +0x9c

goroutine 22 [sync.Cond.Wait, 60 minutes]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc00032e3d0, 0x0)
        /usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc00032e3c0)
        /usr/local/go/src/sync/cond.go:56 +0x9d
github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc000120a20, 0x0)
        /go/pkg/mod/github.com/cihub/seelog@v0.0.0-20151216151435-d2c6e5aa9fbf/behavior_asynclooplogger.go:50 +0x8a
github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc000120a20)
        /go/pkg/mod/github.com/cihub/seelog@v0.0.0-20151216151435-d2c6e5aa9fbf/behavior_asynclooplogger.go:63 +0x42
goroutine 10 [chan send]:
github.com/nirmata/kube-netc/pkg/tracker.(*Tracker).run(0x26730e0, 0x0, 0x464c76)
        /build/pkg/tracker/tracker.go:236 +0xad5
github.com/nirmata/kube-netc/pkg/tracker.(*Tracker).StartTracker(0x26730e0)
        /build/pkg/tracker/tracker.go:139 +0x35
created by main.main
        /build/main.go:23 +0x45

goroutine 11 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc0007b0028, 0x1db)
        /usr/local/go/src/runtime/sema.go:513 +0xf8
sync.(*Cond).Wait(0xc0007b0018)
        /usr/local/go/src/sync/cond.go:56 +0x9d
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc0007b0000, 0xc00034f1d0, 0x0, 0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/delta_fifo.go:479 +0xa0
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc00007aa00)
        /go/pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/controller.go:173 +0x40
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00001ddd0)
        /go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004e5dd0, 0x1aa2dc0, 0xc000027b60, 0xc000044001, 0xc000492000)
        /go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00001ddd0, 0x3b9aca00, 0x0, 0xc000490801, 0xc000492000)
        /go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc00007aa00, 0xc000492000)
        /go/pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/controller.go:145 +0x2c4
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc000121b00, 0xc000492000)
        /go/pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/shared_informer.go:377 +0x401
github.com/nirmata/kube-netc/pkg/cluster.(*ClusterInfo).Run(0xc0000b1300)
        /build/pkg/cluster/cluster.go:157 +0x29e
created by main.main
        /build/main.go:26 +0xcc

goroutine 132 [select, 1 minutes]:
github.com/DataDog/datadog-agent/pkg/process/util.(*LogLimit).resetLoop(0xc0004f0340)
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/process/util/log_limit.go:54 +0xbc
created by github.com/DataDog/datadog-agent/pkg/process/util.NewLogLimit
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/process/util/log_limit.go:31 +0xb5

goroutine 131 [IO wait, 1 minutes, locked to thread]:
internal/poll.runtime_pollWait(0x7f65bad4ed20, 0x72, 0xc0013cbd98)
        /usr/local/go/src/runtime/netpoll.go:203 +0x55
internal/poll.(*pollDesc).wait(0xc0000cccd8, 0x72, 0x418801, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).RawRead(0xc0000cccc0, 0xc001c5fb00, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:534 +0xf6
os.(*rawConn).Read(0xc00055c0a0, 0xc001c5fb00, 0x7f65ba203c38, 0xc00045ec98)
        /usr/local/go/src/os/rawconn.go:31 +0x65
github.com/DataDog/datadog-agent/pkg/network/netlink.(*Socket).recvmsg(0xc0000e4b40, 0xc000b12000, 0x8000, 0x8000, 0x0, 0xc000aa0000, 0x461620, 0xc000603380)
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/socket.go:223 +0xe0
github.com/DataDog/datadog-agent/pkg/network/netlink.(*Socket).ReceiveInto(0xc0000e4b40, 0xc000aa0000, 0x1000, 0x1000, 0xff0, 0x0, 0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/socket.go:117 +0x67
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/socket.go:117 +0x67
github.com/DataDog/datadog-agent/pkg/network/netlink.(*Consumer).receive(0xc000124070, 0xc00033a540)
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/consumer.go:270 +0x231
github.com/DataDog/datadog-agent/pkg/network/netlink.(*Consumer).Events.func1()
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/consumer.go:131 +0x86
github.com/DataDog/datadog-agent/pkg/network/netlink.(*Consumer).initWorker.func1.1()
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/consumer.go:196 +0x2d
github.com/DataDog/datadog-agent/pkg/process/util.WithRootNS(0x183855b, 0x5, 0xc0007b87a8, 0x0, 0x0)
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/process/util/netns.go:30 +0x1cc
github.com/DataDog/datadog-agent/pkg/network/netlink.(*Consumer).initWorker.func1(0x183855b, 0x5, 0xc000124070)
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/consumer.go:190 +0x5d
created by github.com/DataDog/datadog-agent/pkg/network/netlink.(*Consumer).initWorker
        /go/pkg/mod/github.com/!data!dog/datadog-agent@v0.0.0-20200605170216-0b30b9a1e3ac/pkg/network/netlink/consumer.go:189 +0x53

goroutine 136 [select]:
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func2(0xc000484340, 0xc000492000, 0xc0000a6cc0, 0xc00033b3e0)
        /go/pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:328 +0x14a
created by k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch
        /go/pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:322 +0x276
drewrip commented 4 years ago

I added a mutex to the ObjectIPMap that is present in the cluster package to make access to it threadsafe. So hopefully that should solve our restart problem. 8b9b402

I am not going to close the issue just yet until we know this fixed it but hopefully this is what does it.

JimBugwadia commented 4 years ago

I am still seeing this:

ubuntu@ip-10-10-129-44:~$ kubectl get pods -A | grep netc
kube-system       kube-netc-nm5j7                                1/1     Running   62         6d20h

Nothing unusual in the logs:

{"level":"warn","ts":"2020-07-23T07:22:13.735Z","msg":"extreme transfer rate","package":"tracker","source":"10.10.129.239:2181","direction":"sent","bps":70695954795}
{"level":"warn","ts":"2020-07-23T07:22:14.731Z","msg":"extreme transfer rate","package":"tracker","source":"10.10.131.251:44006","direction":"sent","bps":53136722965}
{"level":"warn","ts":"2020-07-23T07:22:14.734Z","msg":"extreme transfer rate","package":"tracker","source":"10.10.129.239:2181","direction":"sent","bps":82850415419}
{"level":"warn","ts":"2020-07-23T07:22:15.729Z","msg":"extreme transfer rate","package":"tracker","source":"10.10.129.239:2181","direction":"sent","bps":237007221801}
{"level":"warn","ts":"2020-07-23T07:22:15.735Z","msg":"extreme transfer rate","package":"tracker","source":"10.10.131.251:44006","direction":"recv","bps":66751592189}
{"level":"warn","ts":"2020-07-23T07:22:16.730Z","msg":"extreme transfer rate","package":"tracker","source":"10.10.129.239:2181","direction":"sent","bps":244325343580}

Occasionally there are these message in the logs:

E0723 07:22:13.699329       6 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Node: 

E0723 07:22:30.491520       6 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "services" in API group "" at the cluster scope

E0723 07:22:37.956938       6 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: services is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "services" in API group "" at the cluster scope
JimBugwadia commented 4 years ago

Found this:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14dc502]

goroutine 66 [running]:
github.com/nirmata/kube-netc/pkg/collector.generateLabels(0xc000f02c60, 0xc, 0x20fb, 0xc000f02c50, 0xc, 0x0, 0x0, 0x0, 0x0, 0x1, ...)
        /build/pkg/collector/collector.go:54 +0x572
github.com/nirmata/kube-netc/pkg/collector.StartCollector(0x26c6960, 0xc00000ee40, 0xc00050e0b0)
        /build/pkg/collector/collector.go:76 +0x19c
created by main.main
        /build/main.go:67 +0x2c1