Closed t3hmrman closed 6 years ago
I do want to note that the exact same configuration I have worked on the same exact machine when using containerd
-- I do think it has something to do with rkt
probably intersecting with the fact that --cap mon 'allow *'
is being used...
@t3hmrman If you run the ceph-authtool command in the operator pod directly, do you get the same error with the core dump?
ceph-authtool --create-keyring /var/lib/rook/rook/mon.keyring --gen-key -n mon. --cap mon 'allow *'
@t3hmrman --cap mon 'allow *'
is a flag of the ceph-authtool
not for the container or something.
The flag is for the ceph-authtool
command.
The flags in this case create a "role" which has the capability to access everything of the mons (mon allow *
) in Ceph context. This Ceph "role" is used for the mons to be able to access each other.
The actual error message (which could be caused by ceph-authool
) is:
signal: bad system call (core dumped)
Could you check if there is a core dump on the node the rook-operator Pod runs on (command: coredumpctl
).
Thanks for the feedback @travisn and @galexrt -- my next step will be to increase the logging on the operator and see if I can get more output (and try and run it from a shell inside the pod). @galexrt thanks for clarifying on the meaning of --cap
, my hunch was that it was a kernel-related option and maybe that's why the system call was core dumping.
I will find out what's really going on by using coredumpctl
, and try running the command directly, and report back here
[EDIT] - Bad news, coredumpctl
on the node isn't showing anything.
[EDIT2] - Well, this just got more confusing -- the command doesn't error when run directly from a shell inside the container:
$ k exec -it rook-operator-6bd9c49744-xcnsg -n rook -- /bin/bash
root@rook-operator-6bd9c49744-xcnsg:/#
root@rook-operator-6bd9c49744-xcnsg:/# ceph-authtool --create-keyring /var/lib/rook/rook/mon.keyring --gen-key -n mon. --cap mon 'allow *'
creating /var/lib/rook/rook/mon.keyring
[EDIT3] -
The folder it was supposed to create on the host disk is empty, but I just realized that the keyring is inside the container @ /var/lib/rook
, looking into my mounts now to see if I somehow removed a mount that was supposed to be hostPath
ed.
[EDIT4] -
My spider sense (deja vu?) is tingling... I think I've literally run into this exact same problem before -- I feel like I've had a case before where I had to hostPath
on rkt
before specifically for this issue...
Also, installing coredumpctl
(it isn't present by default in Ubuntu 16.04 which the container is build on) and running it inside the container doesn't yield anything, despite seeing that the process is core dumping... Maybe I could strace
?
[EDIT5] -
After some careful reading, I think I more clearly understand the problem -- the ceph-autotool
command isn't failing, it's the cluster startup that's failing -- I'm going to see if I can delete the cluster and see if the operator and agent will start without errors.
[EDIT6] - Yep, suspicion confirmed, the issue is closer to cluster creation, if I delete the cluster I had and re-start the operator it is error-free:
2018-05-25 04:10:37.944513 I | rook: starting Rook v0.7.1 with arguments '/usr/local/bin/rook operator'
2018-05-25 04:10:37.944593 I | rook: flag values: --help=false, --log-level=INFO, --mon-healthcheck-interval=45s, --mon-out-timeout=5m0s
2018-05-25 04:10:37.946328 I | rook: starting operator
2018-05-25 04:10:40.624968 I | op-k8sutil: cluster role rook-agent already exists. Updating if needed.
2018-05-25 04:10:40.655405 I | op-agent: getting flexvolume dir path from FLEXVOLUME_DIR_PATH env var
2018-05-25 04:10:40.655427 I | op-agent: discovered flexvolume dir path from source env var. value: /var/lib/kubelet/volumeplugins
2018-05-25 04:10:40.671551 I | op-agent: rook-agent daemonset already exists
2018-05-25 04:10:40.673200 I | operator: rook-provisioner started
2018-05-25 04:10:40.673220 I | op-cluster: start watching clusters in all namespaces
The errors I posted about earlier show up when I try to start a new cluster, I'll check and make sure my cluster resource config matches what's in the examples...
[EDIT7] -
I've forgotten whether the drives needed to be formatted before startup (I am pretty sure they didn't and rook would do it for me), so what I'm going to try is formatting the second drive and mounting it to a folder on the host and see if I can get a filestore
cluster to work, current cluster is bluestore
pointed @ /dev/sdb
[EDIT8] - Hmnn, switching to filestore also doesn't help, and re-checking the documentation I could confirm that the drives will be formatted @ startup, so I should be OK there. I'm a bit perplexed by the error messages coming out of the operator, because I see the cluster getting created...
[EDIT9] -
Maybe my issue is related to this other ticket about systemd
?
Here's what the processes inside hierarchy looks like inside the container:
root@rook-operator-5c6496b6c5-tzsdl:/# ps --forest aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 89 0.0 0.0 18228 3284 ? S 04:34 0:00 /bin/bash
root 99 0.0 0.0 34424 2964 ? R+ 04:38 0:00 \_ ps --forest aux
root 1 0.0 0.0 62348 7420 ? Ss 04:10 0:00 /usr/lib/systemd/systemd --default-standard-output=tty --log-target=null --show-status=0
root 3 0.0 0.0 66408 8836 ? Ss 04:10 0:00 /usr/lib/systemd/systemd-journald
root 6 0.0 0.0 4364 644 ? Ss 04:10 0:00 /tini -- /usr/local/bin/rook operator
root 11 0.0 0.1 40608 25272 ? Sl 04:10 0:00 \_ /usr/local/bin/rook operator
[EDIT10]
When I strace the parent, I get a repeating error:
rt_sigtimedwait(~[ILL TRAP ABRT BUS FPE SEGV TTIN TTOU SYS RTMIN RT_1], 0x7ffd9b24c7e0, {1, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, 0x7ffd9b24c84c, WNOHANG, NULL) = 0
In the child I see:
futex(0xc420032f10, FUTEX_WAKE, 1) = 1
futex(0x1be3970, FUTEX_WAIT, 0, NULL) = 0
epoll_wait(4, [], 128, 0) = 0
epoll_wait(4, [{EPOLLOUT, {u32=2543135856, u64=139820908489840}}], 128, -1) = 1
epoll_wait(4, [{EPOLLOUT, {u32=2543135664, u64=139820908489648}}], 128, -1) = 1
epoll_wait(4, [{EPOLLHUP, {u32=2543136240, u64=139820908490224}}, {EPOLLHUP, {u32=2543136048, u64=139820908490032}}], 128, -1) = 2
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=403, si_uid=0, si_status=SIGSYS, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 2
futex(0xc420ac0110, FUTEX_WAKE, 1) = 1
read(8, "", 512) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 8, 0xc420803df4) = 0
close(8) = 0
futex(0xc420ac0110, FUTEX_WAKE, 1) = 1
write(2, "2018-05-25 04:59:37.899010 E | o"..., 302) = 302
futex(0x1be3970, FUTEX_WAIT, 0, NULL) = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL) = 0
pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
futex(0x1be3970, FUTEX_WAIT, 0, NULL) = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL) = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL) = 0
pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
futex(0x1be3970, FUTEX_WAIT, 0, NULL) = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL
Which is basically just polling (I assume a golang select
over the result of a goroutine which started the child process? haven' tlooked at the code?) but the child is dying, so the child process is getting killed...?
Hey I figured it out -- I noticed the PodSecurityPolicy
in the helm chart and wondered if now (or always??) Rook required the operator to be run as a privileged pod. I could have sworn it didn't before, but for my setup that's what fixes it.
Explains the bad system calls -- the system calls allowed by rkt
versus containerd
might be different, but in the end adding the securityContext: privileged: true
to the operator deployment fixed it.
I spoke too fast -- it looks like the mon
created by the operator
is also running into the same problem -- is every container run by rook supposed to be privileged? I don't think this was a requirement before, but looking at the newer helm charts it looks like the PodSecurityContext
is ensuring that all pods started by rook-ceph-operator
are privileged?
I'm going to assume that this is a requirement now and keep going but I would love an answer, feels like I'm doing something super wrong
I think I've narrowed it down a little more... Looking at the error:
Failed to complete gen secret: signal: bad system call (core dumped)
If I look at the code that's running gen secret, I assume it's trying to setuid to root, and that's why it's having a hard time -- I think this behavior is rkt
specific.
[EDIT] - Yup, I'm almost positive this is due to a difference between the capabilities
provided by rkt
vs other runtimes -- maybe it's not reasonable for rook to handle this, if it's only focusing on CRI compliant runtimes
re-closing, this is an issue with rkt
: https://github.com/rkt/rkt/issues/3823 and https://github.com/rkt/rkt/issues/3368 are the relevant issues there
Is this a bug report or feature request?
Bug Report
What happened:
When starting
rook
therook-agent
starts, butrook-operator
ends up error-looping (so nomon
s orosd
pods are created).Here's what the logs say:
It looks like the failure was during the running of
ceph-authtool
but it's not really clear why it failed. At first I was getting a warning about not running as PID 1, so I set theTINI_SUBREAPER
env variable and the warning went away. It looks like the problem is deeper than that.What you expected to happen:
The
rook-operator
to spawn all necessary pods.How to reproduce it (minimal and precise):
rkt
rook
resources (namespace, CRDs, deployment, etc)Environment:
OS (e.g. from /etc/os-release):
Kernel (e.g.
uname -a
):Linux localhost 4.14.42-coreos #1 SMP Wed May 23 08:32:56 UTC 2018 x86_64
Cloud provider or hardware configuration: Baremetal
Rook version (use
rook version
inside of a Rook Pod): v0.7.1
Kubernetes version (use
kubectl version
):Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Baremetal
Ceph status (use
ceph health
in the Rook toolbox):N/A, cluster isn't up