Open ottoyiu opened 2 years ago
Thanks for this comprehensive report.
To clarify one thing: rpk
is not a supervisor process here, rather it replaces its own image with that of redpanda
using execve(2)
: so redpanda
is not a chlid of rpk
but rather replaces it (maintaining the same pid, children, etc). This explains why the process name changes as you observed.
So redpanda
is PID 1 and if it crashes (or stops for any reason) I would expect the pod to terminate. If I understand you correctly, you are saying that rpk/redpanda actually restart after the first crash, in the same pod? That is quite unexpected and I'm not sure the mechanism which could cause this.
The zombie error message mentions pid 11749
: it would be very nice to know which process this is: is it one of the children of redpanda like the stuff under supervisord
or is it runc
or containerd-shim
.
11749
was the PID1 in that instance, which was redpanda.
So redpanda is PID 1 and if it crashes (or stops for any reason) I would expect the pod to terminate. If I understand you correctly, you are saying that rpk/redpanda actually restart after the first crash, in the same pod? That is quite unexpected and I'm not sure the mechanism which could cause this.
AFAIK, the process would someone restart after the crash in the same container. I'm going to validate this.
you're right, I can't replicate the claim. Seems like the container dies as soon as redpanda dies and a new container gets spun up to take its place.
To clarify one thing: rpk is not a supervisor process here, rather it replaces its own image with that of redpanda using execve(2): so redpanda is not a chlid of rpk but rather replaces it (maintaining the same pid, children, etc). This explains why the process name changes as you observed.
That makes sense. So there should never be the case that redpanda isn't PID1 inside the container.
11e1979 isn't attached to a commit on the repo. Which version are you running?
It was my understanding that the operator overrides the default entrypoint that runs supervisord
, so I'm suprised to see it in pstree. The operator also tends to supply a non-zero value for --reserve-memory
on recent versions, so I'm not expecting to see 0M
Can you describe your environment a bit more? I'm confused.
I've been informed of the situation!
It was my understanding that the operator overrides the default entrypoint that runs supervisord, so I'm suprised to see it in pstree. The operator also tends to supply a non-zero value for --reserve-memory on recent versions, so I'm not expecting to see 0M
This is using a helm chart, not the operator. It uses the our default package, e.g, as appears on docker hub. The entry point for that image launches supervisord
as a daemon before exec'ing rpk
in order to start redpanda.
I think one question is if it makes sense to have supervisord
as a child of redpanda
or if these should both be children of some other process (which can be PID 1).
11749 was the PID1 in that instance, which was redpanda.
That's interesting. So if redpanda
was zombie it implies that redpanda did die, but the parent isn't reaping it. In this case the parent is runc
, I think? So I don't think redpanda
would prevent runc
from terminating correctly, but the other way around.
Could this all just be caused by the whole container/k8s malfunctioning when the node-wide fd
limit is hit?
This is definitely a weird one :(
We only saw one case of redpanda becoming a zombie process and runc was wedged trying to reap redpanda. Other cases, the process just refuses to die. When we try to kill the container, we see that SIGKILL was called on the PID1 (redpanda) but the process remains. This may be due to FDs?
We validated that node-wide fd
limit in both scenarios were not hit.
Try this instead:
containers:
- command:
- /usr/bin/rpk
- args:
- |
redpanda start --smp=32 --memory=90G --reserve-memory=6G --advertise-kafka-addr=$(POD_IP):9092 --kafka-addr=$(POD_IP):9092 --rpc-addr=$(POD_IP):33145 --advertise-rpc-addr=$(POD_IP):33145 --default-log-level=error
The two changes:
entrypoint
to avoid supervisord
(I'm assuming coproc
isn't in use, but if it is, this would be better in its own deployment)--reserve-memory
for kernel data structures in the cgroup (this is to avoid OOMKill).About the process not dying, I wonder if it could be related to https://github.com/scylladb/seastar/commit/443e6a9b77b1807a7a5c8ad41cd14d8998d48e27? Though it seems this only applies to a few signals like SIGABRT
or SIGSEGV
and likely not SIGKILL
.
Otto, are any of the problematic behaviors consistently reproducible, or have they been seen only occasionally?
@travisdowns do you think we have more investigation to do on this ticket, or should we close?
Version & Environment
Redpanda version: (use
rpk version
): latest (rev 11e1979)/etc/os-release
): Debian 11.2docker info
): 20.10.11kubectl version
): v1.22.7What went wrong?
We observe that when redpanda segfaults, there's no longer "PID1" in the container - yet the container stays around, in a hung state.
This causes the following error:
and prevents runc from terminating properly:
This puts the kubelet in a very bad state and prevents proper pod cleanup due to unexpected CRI invariant: https://github.com/kubernetes/kubernetes/commit/3eadd1a9ead7a009a9abfbd603a5efd0560473cc https://github.com/kubernetes/kubernetes/blob/b56e432f2191419647a6a13b9f5867801850f969/pkg/kubelet/kubelet.go#L1814-L1818
this is the process tree of a healthy "PID1" process in the vantage point of the host after an initial redpanda process crash:
The command that is invoked on the container is:
however, somehow, the process name changes from redpanda to rpk when redpanda crashes for the first time, and then rpk becomes renamed back to redpanda. I'm guessing this is magic by the rpk supervisor.
I see no indications in the codebase that attempt to try to change the process name using: argv[0] prctl syscall pthread_setname_np
Is there any documentation that describes the forking model of redpanda? Does rpk double fork redpanda and somehow detaches the process from the parent?
What should have happened instead?
How to reproduce the issue?
Additional information
Please attach any relevant logs, backtraces, or metric charts.
JIRA Link: CORE-949