ceph-authtool command fails in operator after install on baremetal rktnetes machine

t3hmrman commented 6 years ago

Is this a bug report or feature request?

Bug Report

Bug Report

What happened:

When starting rook the rook-agent starts, but rook-operator ends up error-looping (so no mons or osd pods are created).

Here's what the logs say:

2018-05-24 20:07:13.401024 I | op-mon: start running mons
2018-05-24 20:07:13.402051 I | exec: Running command: ceph-authtool --create-keyring /var/lib/rook/rook/mon.keyring --gen-key -n mon. --cap mon 'allow *'
2018-05-24 20:07:13.406401 E | op-cluster: failed to create cluster in namespace rook. failed to start the mons. failed to initialize ceph cluster info. failed to get cluster info. failed to create mon secrets. failed to gen secret. Failed to complete gen secret: signal: bad system call (core dumped)

It looks like the failure was during the running of ceph-authtool but it's not really clear why it failed. At first I was getting a warning about not running as PID 1, so I set the TINI_SUBREAPER env variable and the warning went away. It looks like the problem is deeper than that.

What you expected to happen:

The rook-operator to spawn all necessary pods.

How to reproduce it (minimal and precise):

Set up a cluster with rkt
Create rook resources (namespace, CRDs, deployment, etc)

Environment:

OS (e.g. from /etc/os-release):

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.3.1
VERSION_ID=1745.3.1
BUILD_ID=2018-05-23-0922
PRETTY_NAME="Container Linux by CoreOS 1745.3.1 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Kernel (e.g. uname -a): Linux localhost 4.14.42-coreos #1 SMP Wed May 23 08:32:56 UTC 2018 x86_64
Cloud provider or hardware configuration: Baremetal
Rook version (use rook version inside of a Rook Pod): v0.7.1

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T21:07:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Baremetal
Ceph status (use ceph health in the Rook toolbox):

N/A, cluster isn't up

t3hmrman commented 6 years ago

I do want to note that the exact same configuration I have worked on the same exact machine when using containerd -- I do think it has something to do with rkt probably intersecting with the fact that --cap mon 'allow *' is being used...

travisn commented 6 years ago

@t3hmrman If you run the ceph-authtool command in the operator pod directly, do you get the same error with the core dump?

ceph-authtool --create-keyring /var/lib/rook/rook/mon.keyring --gen-key -n mon. --cap mon 'allow *'

galexrt commented 6 years ago

@t3hmrman --cap mon 'allow *' is a flag of the ceph-authtool not for the container or something. The flag is for the ceph-authtool command. The flags in this case create a "role" which has the capability to access everything of the mons (mon allow *) in Ceph context. This Ceph "role" is used for the mons to be able to access each other.

The actual error message (which could be caused by ceph-authool) is:

signal: bad system call (core dumped)

Could you check if there is a core dump on the node the rook-operator Pod runs on (command: coredumpctl).

t3hmrman commented 6 years ago

Thanks for the feedback @travisn and @galexrt -- my next step will be to increase the logging on the operator and see if I can get more output (and try and run it from a shell inside the pod). @galexrt thanks for clarifying on the meaning of --cap, my hunch was that it was a kernel-related option and maybe that's why the system call was core dumping.

I will find out what's really going on by using coredumpctl, and try running the command directly, and report back here

[EDIT] - Bad news, coredumpctl on the node isn't showing anything.

[EDIT2] - Well, this just got more confusing -- the command doesn't error when run directly from a shell inside the container:

$ k exec -it rook-operator-6bd9c49744-xcnsg -n rook -- /bin/bash
root@rook-operator-6bd9c49744-xcnsg:/#
root@rook-operator-6bd9c49744-xcnsg:/# ceph-authtool --create-keyring /var/lib/rook/rook/mon.keyring --gen-key -n mon. --cap mon 'allow *'
creating /var/lib/rook/rook/mon.keyring

[EDIT3] - The folder it was supposed to create on the host disk is empty, but I just realized that the keyring is inside the container @ /var/lib/rook, looking into my mounts now to see if I somehow removed a mount that was supposed to be hostPathed.

[EDIT4] - My spider sense (deja vu?) is tingling... I think I've literally run into this exact same problem before -- I feel like I've had a case before where I had to hostPath on rkt before specifically for this issue...

Also, installing coredumpctl (it isn't present by default in Ubuntu 16.04 which the container is build on) and running it inside the container doesn't yield anything, despite seeing that the process is core dumping... Maybe I could strace?

[EDIT5] - After some careful reading, I think I more clearly understand the problem -- the ceph-autotool command isn't failing, it's the cluster startup that's failing -- I'm going to see if I can delete the cluster and see if the operator and agent will start without errors.

[EDIT6] - Yep, suspicion confirmed, the issue is closer to cluster creation, if I delete the cluster I had and re-start the operator it is error-free:

2018-05-25 04:10:37.944513 I | rook: starting Rook v0.7.1 with arguments '/usr/local/bin/rook operator'
2018-05-25 04:10:37.944593 I | rook: flag values: --help=false, --log-level=INFO, --mon-healthcheck-interval=45s, --mon-out-timeout=5m0s
2018-05-25 04:10:37.946328 I | rook: starting operator
2018-05-25 04:10:40.624968 I | op-k8sutil: cluster role rook-agent already exists. Updating if needed.
2018-05-25 04:10:40.655405 I | op-agent: getting flexvolume dir path from FLEXVOLUME_DIR_PATH env var
2018-05-25 04:10:40.655427 I | op-agent: discovered flexvolume dir path from source env var. value: /var/lib/kubelet/volumeplugins
2018-05-25 04:10:40.671551 I | op-agent: rook-agent daemonset already exists
2018-05-25 04:10:40.673200 I | operator: rook-provisioner started
2018-05-25 04:10:40.673220 I | op-cluster: start watching clusters in all namespaces

The errors I posted about earlier show up when I try to start a new cluster, I'll check and make sure my cluster resource config matches what's in the examples...

[EDIT7] - I've forgotten whether the drives needed to be formatted before startup (I am pretty sure they didn't and rook would do it for me), so what I'm going to try is formatting the second drive and mounting it to a folder on the host and see if I can get a filestore cluster to work, current cluster is bluestore pointed @ /dev/sdb

[EDIT8] - Hmnn, switching to filestore also doesn't help, and re-checking the documentation I could confirm that the drives will be formatted @ startup, so I should be OK there. I'm a bit perplexed by the error messages coming out of the operator, because I see the cluster getting created...

[EDIT9] - Maybe my issue is related to this other ticket about systemd?

Here's what the processes inside hierarchy looks like inside the container:

root@rook-operator-5c6496b6c5-tzsdl:/# ps --forest aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        89  0.0  0.0  18228  3284 ?        S    04:34   0:00 /bin/bash
root        99  0.0  0.0  34424  2964 ?        R+   04:38   0:00  \_ ps --forest aux
root         1  0.0  0.0  62348  7420 ?        Ss   04:10   0:00 /usr/lib/systemd/systemd --default-standard-output=tty --log-target=null --show-status=0
root         3  0.0  0.0  66408  8836 ?        Ss   04:10   0:00 /usr/lib/systemd/systemd-journald
root         6  0.0  0.0   4364   644 ?        Ss   04:10   0:00 /tini -- /usr/local/bin/rook operator
root        11  0.0  0.1  40608 25272 ?        Sl   04:10   0:00  \_ /usr/local/bin/rook operator

[EDIT10]

When I strace the parent, I get a repeating error:

rt_sigtimedwait(~[ILL TRAP ABRT BUS FPE SEGV TTIN TTOU SYS RTMIN RT_1], 0x7ffd9b24c7e0, {1, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)                                                                                                                                
wait4(-1, 0x7ffd9b24c84c, WNOHANG, NULL) = 0

In the child I see:

futex(0xc420032f10, FUTEX_WAKE, 1)      = 1
futex(0x1be3970, FUTEX_WAIT, 0, NULL)   = 0
epoll_wait(4, [], 128, 0)               = 0
epoll_wait(4, [{EPOLLOUT, {u32=2543135856, u64=139820908489840}}], 128, -1) = 1
epoll_wait(4, [{EPOLLOUT, {u32=2543135664, u64=139820908489648}}], 128, -1) = 1
epoll_wait(4, [{EPOLLHUP, {u32=2543136240, u64=139820908490224}}, {EPOLLHUP, {u32=2543136048, u64=139820908490032}}], 128, -1) = 2
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=403, si_uid=0, si_status=SIGSYS, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 2
futex(0xc420ac0110, FUTEX_WAKE, 1)      = 1
read(8, "", 512)                        = 0
epoll_ctl(4, EPOLL_CTL_DEL, 8, 0xc420803df4) = 0
close(8)                                = 0
futex(0xc420ac0110, FUTEX_WAKE, 1)      = 1
write(2, "2018-05-25 04:59:37.899010 E | o"..., 302) = 302
futex(0x1be3970, FUTEX_WAIT, 0, NULL)   = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL)   = 0
pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
futex(0x1be3970, FUTEX_WAIT, 0, NULL)   = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL)   = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL)   = 0
pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout)
futex(0x1be3970, FUTEX_WAIT, 0, NULL)   = 0
futex(0x1be3970, FUTEX_WAIT, 0, NULL

Which is basically just polling (I assume a golang select over the result of a goroutine which started the child process? haven' tlooked at the code?) but the child is dying, so the child process is getting killed...?

t3hmrman commented 6 years ago

Hey I figured it out -- I noticed the PodSecurityPolicy in the helm chart and wondered if now (or always??) Rook required the operator to be run as a privileged pod. I could have sworn it didn't before, but for my setup that's what fixes it.

Explains the bad system calls -- the system calls allowed by rkt versus containerd might be different, but in the end adding the securityContext: privileged: true to the operator deployment fixed it.

t3hmrman commented 6 years ago

I spoke too fast -- it looks like the mon created by the operator is also running into the same problem -- is every container run by rook supposed to be privileged? I don't think this was a requirement before, but looking at the newer helm charts it looks like the PodSecurityContext is ensuring that all pods started by rook-ceph-operator are privileged?

I'm going to assume that this is a requirement now and keep going but I would love an answer, feels like I'm doing something super wrong

t3hmrman commented 6 years ago

I think I've narrowed it down a little more... Looking at the error:

Failed to complete gen secret: signal: bad system call (core dumped)

If I look at the code that's running gen secret, I assume it's trying to setuid to root, and that's why it's having a hard time -- I think this behavior is rkt specific.

[EDIT] - Yup, I'm almost positive this is due to a difference between the capabilities provided by rkt vs other runtimes -- maybe it's not reasonable for rook to handle this, if it's only focusing on CRI compliant runtimes

t3hmrman commented 6 years ago

re-closing, this is an issue with rkt: https://github.com/rkt/rkt/issues/3823 and https://github.com/rkt/rkt/issues/3368 are the relevant issues there

rook / rook

ceph-authtool command fails in operator after install on baremetal rktnetes machine #1739