cannot kill Pod because of PID1 issue with container when redpanda exits unexpectedly

ottoyiu commented 2 years ago

Version & Environment

Redpanda version: (use rpk version): latest (rev 11e1979)

Operating System (e.g. from /etc/os-release): Debian 11.2
Any Kafka client libraries in use (e.g. rdkafka, franz-go)
Docker (use docker info): 20.10.11
Kubernetes (use kubectl version): v1.22.7

What went wrong?

libc++abi: terminating with uncaught exception of type std::bad_alloc: std::bad_alloc
Segmentation fault on shard 0.
Backtrace:
  0x3d7c428
  0x291a486e33bf
  /opt/redpanda/lib/libc.so.6+0x22940
  0x291a485dfeb5
  /opt/redpanda/lib/libc++abi.so.1+0xf8fe
  0x291a485df082
  0x291a485df027
  0x4073a6a
  0x34081b1
  0x3408f1e
  0x3d2b98f
  0x3d2ede7
  0x3d2c215
  0x3c7bc3c
  0x3c79920
  0x1398c24
  0x4073e1c
  /opt/redpanda/lib/libc.so.6+0x240b2
  0x1395dad

We observe that when redpanda segfaults, there's no longer "PID1" in the container - yet the container stays around, in a hung state.

This causes the following error:

Events:
  Type     Reason         Age                From     Message
  ----     ------         ----               ----     -------
  Warning  FailedKillPod  11m                kubelet  error killing pod: failed to "KillContainer" for "redpanda-cluster" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: c735e2cf774dd16fa62b6b0b7fc3de7d15a534b7f3804e8c869b3503d1088da8: container c735e2cf774d PID 11749 is zombie and can not be killed. Use the --init option when creating containers to run an init inside the container that forwards signals and reaps processes"
  Normal   Killing        11m (x2 over 12m)  kubelet  Stopping container redpanda-cluster

and prevents runc from terminating properly:

root     35697 90.9  0.0      0     0 ?        Zl   21:27  97:51 [containerd-shim] <defunct>
root     46857  0.0  0.0      0     0 ?        Z    22:40   0:00 [runc] <defunct>

This puts the kubelet in a very bad state and prevents proper pod cleanup due to unexpected CRI invariant: https://github.com/kubernetes/kubernetes/commit/3eadd1a9ead7a009a9abfbd603a5efd0560473cc https://github.com/kubernetes/kubernetes/blob/b56e432f2191419647a6a13b9f5867801850f969/pkg/kubelet/kubelet.go#L1814-L1818

this is the process tree of a healthy "PID1" process in the vantage point of the host after an initial redpanda process crash:

~# pstree 75975
rpk-+-supervisord-+-coproc.sh---inotifywait
    |             |-rpk-status.sh---sleep
    |             `-9*[{supervisord}]
    `-20*[{rpk}]

to 

~# pstree 75975
redpanda-+-supervisord-+-coproc.sh---inotifywait
         |             |-rpk-status.sh---sleep
         |             `-9*[{supervisord}]
         `-63*[{redpanda}]

The command that is invoked on the container is:

  containers:                                                                                                                                                                                                                                                                                                                                              
  - args:                                                                                                                                                                                                                                                                                                                                                  
    - |                                                                                                                                                                                                                                                                                                                                                    
      redpanda start --smp=32 --memory=90G --reserve-memory=0M --advertise-kafka-addr=$(POD_IP):9092 --kafka-addr=$(POD_IP):9092 --rpc-addr=$(POD_IP):33145 --advertise-rpc-addr=$(POD_IP):33145 --default-log-level=error

however, somehow, the process name changes from redpanda to rpk when redpanda crashes for the first time, and then rpk becomes renamed back to redpanda. I'm guessing this is magic by the rpk supervisor.

I see no indications in the codebase that attempt to try to change the process name using: argv[0] prctl syscall pthread_setname_np

Is there any documentation that describes the forking model of redpanda? Does rpk double fork redpanda and somehow detaches the process from the parent?

What should have happened instead?

there is always a PID1 inside the container that will reap child processes created by PID1 originally.

How to reproduce the issue?

run redpanda in kubernetes using command listed above
crash redpanda process, let rpk spin up a new redpanda process
crash redpanda again (ideally with tons and tons of file descriptors)
observe defunct process and hung container where containerd/runc cannot clean up.

Additional information

Please attach any relevant logs, backtraces, or metric charts.

JIRA Link: CORE-949

travisdowns commented 2 years ago

Thanks for this comprehensive report.

To clarify one thing: rpk is not a supervisor process here, rather it replaces its own image with that of redpanda using execve(2): so redpanda is not a chlid of rpk but rather replaces it (maintaining the same pid, children, etc). This explains why the process name changes as you observed.

So redpanda is PID 1 and if it crashes (or stops for any reason) I would expect the pod to terminate. If I understand you correctly, you are saying that rpk/redpanda actually restart after the first crash, in the same pod? That is quite unexpected and I'm not sure the mechanism which could cause this.

The zombie error message mentions pid 11749: it would be very nice to know which process this is: is it one of the children of redpanda like the stuff under supervisord or is it runc or containerd-shim.

ottoyiu commented 2 years ago

11749 was the PID1 in that instance, which was redpanda.

So redpanda is PID 1 and if it crashes (or stops for any reason) I would expect the pod to terminate. If I understand you correctly, you are saying that rpk/redpanda actually restart after the first crash, in the same pod? That is quite unexpected and I'm not sure the mechanism which could cause this.

~~AFAIK, the process would someone restart after the crash in the same container. I'm going to validate this.~~ you're right, I can't replicate the claim. Seems like the container dies as soon as redpanda dies and a new container gets spun up to take its place.

To clarify one thing: rpk is not a supervisor process here, rather it replaces its own image with that of redpanda using execve(2): so redpanda is not a chlid of rpk but rather replaces it (maintaining the same pid, children, etc). This explains why the process name changes as you observed.

That makes sense. So there should never be the case that redpanda isn't PID1 inside the container.

BenPope commented 2 years ago

11e1979 isn't attached to a commit on the repo. Which version are you running?

It was my understanding that the operator overrides the default entrypoint that runs supervisord, so I'm suprised to see it in pstree. The operator also tends to supply a non-zero value for --reserve-memory on recent versions, so I'm not expecting to see 0M

Can you describe your environment a bit more? I'm confused.

BenPope commented 2 years ago

I've been informed of the situation!

travisdowns commented 2 years ago

It was my understanding that the operator overrides the default entrypoint that runs supervisord, so I'm suprised to see it in pstree. The operator also tends to supply a non-zero value for --reserve-memory on recent versions, so I'm not expecting to see 0M

This is using a helm chart, not the operator. It uses the our default package, e.g, as appears on docker hub. The entry point for that image launches supervisord as a daemon before exec'ing rpk in order to start redpanda.

I think one question is if it makes sense to have supervisord as a child of redpanda or if these should both be children of some other process (which can be PID 1).

travisdowns commented 2 years ago

11749 was the PID1 in that instance, which was redpanda.

That's interesting. So if redpanda was zombie it implies that redpanda did die, but the parent isn't reaping it. In this case the parent is runc, I think? So I don't think redpanda would prevent runc from terminating correctly, but the other way around.

Could this all just be caused by the whole container/k8s malfunctioning when the node-wide fd limit is hit?

ottoyiu commented 2 years ago

This is definitely a weird one :(

We only saw one case of redpanda becoming a zombie process and runc was wedged trying to reap redpanda. Other cases, the process just refuses to die. When we try to kill the container, we see that SIGKILL was called on the PID1 (redpanda) but the process remains. This may be due to FDs?

We validated that node-wide fd limit in both scenarios were not hit.

BenPope commented 2 years ago

Try this instead:

 containers:
  - command:
    - /usr/bin/rpk
  - args:
    - |
      redpanda start --smp=32 --memory=90G --reserve-memory=6G --advertise-kafka-addr=$(POD_IP):9092 --kafka-addr=$(POD_IP):9092 --rpc-addr=$(POD_IP):33145 --advertise-rpc-addr=$(POD_IP):33145 --default-log-level=error

The two changes:

Override the entrypoint to avoid supervisord (I'm assuming coproc isn't in use, but if it is, this would be better in its own deployment)
Add some --reserve-memory for kernel data structures in the cgroup (this is to avoid OOMKill).

travisdowns commented 2 years ago

About the process not dying, I wonder if it could be related to https://github.com/scylladb/seastar/commit/443e6a9b77b1807a7a5c8ad41cd14d8998d48e27? Though it seems this only applies to a few signals like SIGABRT or SIGSEGV and likely not SIGKILL.

Otto, are any of the problematic behaviors consistently reproducible, or have they been seen only occasionally?

jcsp commented 1 year ago

@travisdowns do you think we have more investigation to do on this ticket, or should we close?

redpanda-data / redpanda