All pods in NS crashed with Tap

kferrone commented 3 years ago

Description

All pods in the namespace of the pod I tapped started tripping out. Some time after I ran the command to tap my pod, random pods in the same namespace started failing and restarting. It didn't do this right away, it started happening an hour or so after I left the tap on, i.e. after I was done sniffing some headers, I didn't do kubectl tap off my-service. Not only did pods start failing, entire nodes started getting tainted with NoSchedule which in turn caused the cluster autoscaler to overwork itself replacing failed nodes over and over.

Kubectl commands to create reproducable environment / deployment

First off, when I ran the initialize command, it would always complain the tap took too long and didn't immediately port-forward on it's own. Here is what I ran.

kubectl tap on -n my-ns -p 4000 my-service --port-forward

Then because the port-forward didn't activate because of timeout, I ran:

kubectl port-forward svc/my-service 2244:2244 -n my-ns

Then I did my sniffings then killed the port-forward, but did not turn off the tap. Leaving that extra container in one pod seemed to cause all hell to break loose in the namespace. As soon as I turn it off, everything went back to normal.

Screenshots or other information

Kubernetes client version: 1.17 Kubernetes server version: 1.17 Cloud: AWS EKS

One thing to note is we have Appmesh Auto-Inject active on the namespace. Not all pods in the NS are injected with Appmesh, however the pod I injected with tap was also injected with Appmesh. This means the pod had an X-Ray sidecar and an Envoy sidecar already present when I injected the tap. Maybe this was part of the issue?

Eriner commented 3 years ago

Hi @kferrone, sorry you encountered this issue. Have you been able to reproduce the issue, by chance? Do you have a set of manifests I could apply to a local cluster to reproduce on my end?

It's possible that kubetap's interaction with the other sidecars is causing the problem. Kubetap deploys the mitmproxy sidecar and then essentially sed's the Service port, replacing the target port with the mitmproxy sidecar port. The mitmproxy sidecar then forwards the traffic to the original port. It stores the original port value as an annotation. It is therefore very possible that there is an unfavorable interaction with X-Ray/Envoy.

If you could provide instructions to reproduce this issue, I'd be happy to take a look.

If you're interested in debugging this on your own, I suggest looking at the tap.go file here.

Thanks for filing this issue!

Eriner commented 3 years ago

First off, when I ran the initialize command, it would always complain the tap took too long and didn't immediately port-forward on it's own.

Just to comment on this, the timeout can occur if the Deployment is taking a while to init. That is to say, if the node needs to download an image and blow-up the container to be run, sometimes this can cause the timeout to be reached if the image is large.

soluble-ai / kubetap