Rook becomes destructive during k8s control plane issues

mkhpalm commented 4 years ago

First, thanks for rook!

We had an incident with our k8s control plane that cascaded in an especially bad way with rook. We thought this ticket might help identify bugs or areas to improve in rook.

Deviation from expected behavior:

Did not expect rook to restart when losing a leader election and enter an orchestration run when k8s api is returning errors.
Did not expect rook to take error responses to mean a cluster didn't exist and attempt to create a new cluster over an existing cluster

Expected behavior:

Expected rook not to be orchestrating when no changes were made and expected it to be able to identify valid responses before making changes to state.

How to reproduce it (minimal and precise):

Saturate disk io for k8s etcd processes which will make kube api responses elevate error rates
Rook will lose leader election causing the operator to restart and enter an orchestration loop on container startup.
While the kubernetes api is running elevated error rates, rook does not seem to validate the responses its getting as good vs bad. Causing it to think the ceph cluster doesn't exist and putting it into a "create new cluster" code path that can overwrite an existing healthy cluster. (new mons, mgr, fsids, everything)

File(s) to submit:

Sequence of events assuming I didn't miss anything:

00:00:00 a process kicks off causing high disk io on all the nodes hosting kube-etcd causing error responses and timeouts for everything including rook operator. This lasts for about 2 minutes.
00:00:13 rook fails to submit leader lease because of the instability of kube api and crashes/exits/restarts
00:00:17 a new rook container has started up and begins its orchestration cycle with an increased error response rate from kube api
00:01:15 rook seems to understand the existing healthy mons [d e f] exist (these are upgraded mons for msgr2 conversion months ago)
00:01:15 rook logs an error that it failed to update mgr and starts rook-ceph-version-detection job. Jobs output has the version in it but rook doesn't seem to be getting the message
00:01:45 kube-etcd and kube-api are completely stabilized
00:03:14 rook logs another error cephconfig: clusterInfo:
At this point forward rook is now creating a new ceph cluster over the top of a healthy cluster. It never succeeds and we have to recover the old cluster by hand.

Existing long-running rook-ceph-operator container logs

...
I1104 00:00:13.648297       8 leaderelection.go:263] failed to renew lease storage/rook.io-block: failed to tryAcquireOrRenew context deadline exceeded
I1104 00:00:13.648302       8 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"storage", Name:"ceph.rook.io-block", UID:"d25b1f11-b4a2-11e9-bfbd-56603ae0ab44", APIVersion:"v1", ResourceVersion:"677752656", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' rook-ceph-operator-94bbfff5d-4n6gd_f4965485-fb70-11e9-815a-163fb97c483d stopped leading
F1104 00:00:13.648350       8 controller.go:847] leaderelection lost

Restarted rook-ceph-operator container logs

2019-11-04 00:00:17.241539 I | rookcmd: starting Rook v1.1.2 with arguments '/usr/local/bin/rook ceph operator'
2019-11-04 00:00:17.241725 I | rookcmd: flag values: --add_dir_header=false, --alsologtostderr=false, --csi-attacher-image=registry.local/k8s/csi-attacher:v1.2.0, --csi-ceph-image=registry.local/k8s/cephcsi:v1.2.1, --csi-cephfs-plugin-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin.yaml, --csi-cephfs-provisioner-dep-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-dep.yaml, --csi-cephfs-provisioner-sts-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-sts.yaml, --csi-driver-name-prefix=, --csi-enable-cephfs=false, --csi-enable-grpc-metrics=false, --csi-enable-rbd=false, --csi-kubelet-dir-path=/var/lib/kubelet, --csi-provisioner-image=registry.local/k8s/csi-provisioner:v1.3.0, --csi-rbd-plugin-template-path=/etc/ceph-csi/rbd/csi-rbdplugin.yaml, --csi-rbd-provisioner-dep-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-dep.yaml, --csi-rbd-provisioner-sts-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-sts.yaml, --csi-registrar-image=registry.local/k8s/csi-node-driver-registrar:v1.1.0, --csi-snapshotter-image=registry.local/k8s/csi-snapshotter:v1.2.0, --enable-discovery-daemon=true, --enable-flex-driver=true, --enable-machine-disruption-budget=false, --help=false, --log-flush-frequency=5s, --log-level=INFO, --log_backtrace_at=:0, --log_dir=, --log_file=, --log_file_max_size=1800, --logtostderr=true, --mon-healthcheck-interval=45s, --mon-out-timeout=10m0s, --operator-image=, --service-account=, --skip_headers=false, --skip_log_headers=false, --stderrthreshold=2, --v=0, --vmodule=
2019-11-04 00:00:17.241741 I | cephcmd: starting operator
2019-11-04 00:00:18.596791 I | op-discover: rook-discover daemonset already exists, updating ...
2019-11-04 00:00:18.613159 I | operator: rook-provisioner ceph.rook.io/block started using ceph.rook.io flex vendor dir
I1104 00:00:18.613275       8 leaderelection.go:217] attempting to acquire leader lease  storage/ceph.rook.io-block...
2019-11-04 00:00:18.614290 I | operator: rook-provisioner rook.io/block started using rook.io flex vendor dir
2019-11-04 00:00:18.614319 I | operator: Watching all namespaces for cluster CRDs
2019-11-04 00:00:18.614335 I | op-cluster: start watching clusters in all namespaces
I1104 00:00:18.614341       8 leaderelection.go:217] attempting to acquire leader lease  storage/rook.io-block...
2019-11-04 00:00:18.614376 I | op-cluster: Enabling hotplug orchestration: ROOK_DISABLE_DEVICE_HOTPLUG=false
2019-11-04 00:00:18.614409 I | operator: setting up the controller-runtime manager
2019-11-04 00:00:18.677718 I | op-cluster: starting cluster in namespace rook-ceph
2019-11-04 00:00:18.700774 I | operator: starting the controller-runtime manager
2019-11-04 00:00:18.701314 I | op-agent: getting flexvolume dir path from FLEXVOLUME_DIR_PATH env var
2019-11-04 00:00:18.701337 I | op-agent: flexvolume dir path env var FLEXVOLUME_DIR_PATH is not provided. Defaulting to: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
2019-11-04 00:00:18.701346 I | op-agent: discovered flexvolume dir path from source default. value: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
2019-11-04 00:00:18.701354 I | op-agent: no agent mount security mode given, defaulting to 'Any' mode
2019-11-04 00:00:18.701363 W | op-agent: Invalid ROOK_ENABLE_FSGROUP value "". Defaulting to "true".
2019-11-04 00:00:18.710155 I | op-agent: rook-ceph-agent daemonset already exists, updating ...
2019-11-04 00:00:18.800780 I | operator: CSI driver is not enabled
2019-11-04 00:00:24.801056 I | op-cluster: detecting the ceph image version for image registry.local/k8s/ceph:v14.2.4-20190917...
I1104 00:00:36.125694       8 leaderelection.go:227] successfully acquired lease storage/rook.io-block
I1104 00:00:36.125846       8 controller.go:769] Starting provisioner controller rook.io/block_rook-ceph-operator-94bbfff5d-4n6gd_166a65ab-fe96-11e9-8579-163fb97c483d!
I1104 00:00:36.125845       8 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"storage", Name:"rook.io-block", UID:"d25b16d1-b4a2-11e9-bfbd-56603ae0ab44", APIVersion:"v1", ResourceVersion:"677754369", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' rook-ceph-operator-94bbfff5d-4n6gd_166a65ab-fe96-11e9-8579-163fb97c483d became leader
I1104 00:00:36.926144       8 controller.go:818] Started provisioner controller rook.io/block_rook-ceph-operator-94bbfff5d-4n6gd_166a65ab-fe96-11e9-8579-163fb97c483d!
I1104 00:00:37.510828       8 leaderelection.go:227] successfully acquired lease storage/ceph.rook.io-block
I1104 00:00:37.510960       8 controller.go:769] Starting provisioner controller ceph.rook.io/block_rook-ceph-operator-94bbfff5d-4n6gd_166a369f-fe96-11e9-8579-163fb97c483d!
I1104 00:00:37.511018       8 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"storage", Name:"ceph.rook.io-block", UID:"d25b1f11-b4a2-11e9-bfbd-56603ae0ab44", APIVersion:"v1", ResourceVersion:"677754400", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' rook-ceph-operator-94bbfff5d-4n6gd_166a369f-fe96-11e9-8579-163fb97c483d became leader
I1104 00:00:38.411251       8 controller.go:818] Started provisioner controller ceph.rook.io/block_rook-ceph-operator-94bbfff5d-4n6gd_166a369f-fe96-11e9-8579-163fb97c483d!
2019-11-04 00:01:07.349792 I | op-cluster: Detected ceph image version: 14.2.4 nautilus
2019-11-04 00:01:07.363930 I | op-mon: created csi secret for cluster rook-ceph
2019-11-04 00:01:07.371734 I | op-mon: parsing mon endpoints: e=10.111.69.161:6789,f=10.103.235.236:6789,d=10.97.28.122:6789
2019-11-04 00:01:07.371857 I | op-mon: loaded: maxMonID=5, mons=map[f:0xc0082433a0 d:0xc0082433e0 e:0xc008243360], mapping=&{Node:map[e:0xc0103ebad0 f:0xc0103ebb00 d:0xc0103eba40]}
2019-11-04 00:01:07.372618 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:01:07.372764 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:01:07.372950 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/291699573
2019-11-04 00:01:07.911601 I | op-cluster: CephCluster rook-ceph status: Creating. 
2019-11-04 00:01:07.941575 I | op-mon: start running mons
2019-11-04 00:01:07.951973 I | op-mon: created csi secret for cluster rook-ceph
2019-11-04 00:01:07.954822 I | op-mon: parsing mon endpoints: e=10.111.69.161:6789,f=10.103.235.236:6789,d=10.97.28.122:6789
2019-11-04 00:01:07.954921 I | op-mon: loaded: maxMonID=5, mons=map[e:0xc0082eece0 f:0xc0082eed20 d:0xc0082eed60], mapping=&{Node:map[e:0xc0082ebce0 f:0xc0082ebd10 d:0xc0082ebcb0]}
2019-11-04 00:01:08.336647 I | op-mon: saved mon endpoints to config map map[data:e=10.111.69.161:6789,f=10.103.235.236:6789,d=10.97.28.122:6789 maxMonId:5 mapping:{"node":{"d":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"},"e":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"f":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"}}} csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.111.69.161:6789","10.103.235.236:6789","10.97.28.122:6789"]}]]
2019-11-04 00:01:08.738475 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:01:08.738630 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:01:09.535786 I | op-mon: targeting the mon count 3
2019-11-04 00:01:09.535962 I | exec: Running command: ceph config set global mon_allow_pool_delete true --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/322307152
2019-11-04 00:01:10.103298 I | exec: Running command: ceph config set global bluestore_warn_on_legacy_statfs false --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/040210287
2019-11-04 00:01:10.669095 I | exec: Running command: ceph config set global rbd_default_features 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/566499330
2019-11-04 00:01:11.286740 I | op-mon: checking for basic quorum with existing mons
2019-11-04 00:01:11.346874 I | op-mon: mon e endpoint are [v2:10.111.69.161:3300,v1:10.111.69.161:6789]
2019-11-04 00:01:11.538312 I | op-mon: mon f endpoint are [v2:10.103.235.236:3300,v1:10.103.235.236:6789]
2019-11-04 00:01:12.138203 I | op-mon: mon d endpoint are [v2:10.97.28.122:3300,v1:10.97.28.122:6789]
2019-11-04 00:01:13.136096 I | op-mon: saved mon endpoints to config map map[data:e=10.111.69.161:6789,f=10.103.235.236:6789,d=10.97.28.122:6789 maxMonId:5 mapping:{"node":{"d":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"},"e":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"f":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"}}} csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.111.69.161:6789","10.103.235.236:6789","10.97.28.122:6789"]}]]
2019-11-04 00:01:13.737860 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:01:13.738033 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:01:13.745802 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:01:13.745965 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:01:13.759302 I | op-mon: deployment for mon rook-ceph-mon-e already exists. updating if needed
2019-11-04 00:01:13.763413 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:01:13.763439 I | op-k8sutil: updating deployment rook-ceph-mon-e
2019-11-04 00:01:15.781630 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-e
2019-11-04 00:01:15.781658 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:01:15.781678 I | op-mon: waiting for mon quorum with [e f d]
2019-11-04 00:01:15.900116 I | op-mon: mons running: [e f d]
2019-11-04 00:01:15.900277 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/039804793
2019-11-04 00:01:16.437019 I | op-mon: Monitors in quorum: [d f e]
2019-11-04 00:01:16.441872 I | op-mon: deployment for mon rook-ceph-mon-f already exists. updating if needed
2019-11-04 00:01:16.445654 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:01:16.445679 I | op-k8sutil: updating deployment rook-ceph-mon-f
2019-11-04 00:01:18.477183 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-f
2019-11-04 00:01:18.477256 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:01:18.477279 I | op-mon: waiting for mon quorum with [e f d]
2019-11-04 00:01:18.592085 I | op-mon: mons running: [e f d]
2019-11-04 00:01:18.592316 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/129998980
2019-11-04 00:01:19.168985 I | op-mon: Monitors in quorum: [d f e]
2019-11-04 00:01:19.173706 I | op-mon: deployment for mon rook-ceph-mon-d already exists. updating if needed
2019-11-04 00:01:19.177934 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:01:19.177953 I | op-k8sutil: updating deployment rook-ceph-mon-d
2019-11-04 00:01:21.196072 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-d
2019-11-04 00:01:21.196104 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:01:21.196125 I | op-mon: waiting for mon quorum with [e f d]
2019-11-04 00:01:21.317727 I | op-mon: mons running: [e f d]
2019-11-04 00:01:21.317885 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/566816787
2019-11-04 00:01:21.860468 I | op-mon: Monitors in quorum: [d f e]
2019-11-04 00:01:21.860499 I | op-mon: mons created: 3
2019-11-04 00:01:21.860597 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/799906390
2019-11-04 00:01:22.401535 I | op-mon: waiting for mon quorum with [e f d]
2019-11-04 00:01:22.510588 I | op-mon: mons running: [e f d]
2019-11-04 00:01:22.510726 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/774689725
2019-11-04 00:01:22.995226 I | op-mon: Monitors in quorum: [d f e]
2019-11-04 00:01:22.995342 I | exec: Running command: ceph version --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/512229880
2019-11-04 00:01:23.535602 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/727178487
2019-11-04 00:01:24.090389 I | exec: Running command: ceph mon enable-msgr2 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/101619178
2019-11-04 00:01:24.554301 I | cephclient: successfully enabled msgr2 protocol
2019-11-04 00:01:24.554353 I | op-mgr: start running mgr
2019-11-04 00:01:24.554506 I | exec: Running command: ceph auth get-or-create-key mgr.a mon allow * mds allow * osd allow * --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/274113601
2019-11-04 00:01:25.021424 I | exec: Running command: ceph config get mgr.a mgr/dashboard/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/442920620
2019-11-04 00:01:25.562220 I | exec: Running command: ceph config rm mgr.a mgr/dashboard/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/682215451
2019-11-04 00:01:26.111932 I | op-mgr: clearing http bind fix mod=dashboard ver=13.0.0 mimic changed=true err=<nil>
2019-11-04 00:01:26.112075 I | exec: Running command: ceph config get mgr.a mgr/dashboard/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/478255294
2019-11-04 00:01:26.704513 I | exec: Running command: ceph config rm mgr.a mgr/dashboard/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/848303877
2019-11-04 00:01:27.179109 I | op-mgr: clearing http bind fix mod=dashboard ver=13.0.0 mimic changed=true err=<nil>
2019-11-04 00:01:27.179254 I | exec: Running command: ceph config get mgr.a mgr/prometheus/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/360634016
2019-11-04 00:01:27.717001 I | exec: Running command: ceph config rm mgr.a mgr/prometheus/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/079761791
2019-11-04 00:01:28.263670 I | op-mgr: clearing http bind fix mod=prometheus ver=13.0.0 mimic changed=false err=<nil>
2019-11-04 00:01:28.263795 I | exec: Running command: ceph config get mgr.a mgr/prometheus/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/314515154
2019-11-04 00:01:28.839512 I | exec: Running command: ceph config rm mgr.a mgr/prometheus/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/865973257
2019-11-04 00:01:29.316552 I | op-mgr: clearing http bind fix mod=prometheus ver=13.0.0 mimic changed=false err=<nil>
2019-11-04 00:01:29.324136 I | op-mgr: deployment for mgr rook-ceph-mgr-a already exists. updating if needed
2019-11-04 00:01:29.327259 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:01:29.327280 I | op-k8sutil: updating deployment rook-ceph-mgr-a
2019-11-04 00:01:31.345728 E | op-cluster: failed to create cluster in namespace rook-ceph. failed to start the ceph mgr. failed to update mgr deployment rook-ceph-mgr-a. failed to get deployment rook-ceph-mgr-a. deployments.apps "rook-ceph-mgr-a" not found
2019-11-04 00:01:36.800979 I | op-cluster: detecting the ceph image version for image registry.local/k8s/ceph:v14.2.4-20190917...
2019-11-04 00:01:36.814934 I | op-k8sutil: Retrying 20 more times every 2ns seconds for ConfigMap rook-ceph-detect-version to be deleted
2019-11-04 00:01:38.818742 I | op-k8sutil: Retrying 19 more times every 2ns seconds for ConfigMap rook-ceph-detect-version to be deleted
2019-11-04 00:01:40.822138 I | op-k8sutil: Retrying 18 more times every 2ns seconds for ConfigMap rook-ceph-detect-version to be deleted
2019-11-04 00:01:42.827679 I | op-k8sutil: Retrying 17 more times every 2ns seconds for ConfigMap rook-ceph-detect-version to be deleted
2019-11-04 00:01:44.830827 I | op-k8sutil: Retrying 16 more times every 2ns seconds for ConfigMap rook-ceph-detect-version to be deleted
2019-11-04 00:01:46.834657 I | op-k8sutil: Retrying 15 more times every 2ns seconds for ConfigMap rook-ceph-detect-version to be deleted
2019-11-04 00:01:48.841845 I | op-k8sutil: Removing previous job rook-ceph-detect-version to start a new one
2019-11-04 00:01:48.851060 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:01:50.855196 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:01:52.860176 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:01:54.864250 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:01:56.868468 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:01:58.872991 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:00.877233 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:02.881781 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:04.886006 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:06.889884 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:08.894120 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:10.897833 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:12.902331 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:14.906439 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:16.911205 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:18.915405 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:20.920708 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:22.925679 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:24.929832 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:26.933731 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:28.933967 W | op-k8sutil: gave up waiting for batch job rook-ceph-detect-version to be deleted
2019-11-04 00:02:28.945448 E | op-cluster: failed the ceph version check. failed to complete ceph version job. failed to run CmdReporter rook-ceph-detect-version successfully. failed to run job. object is being deleted: jobs.batch "rook-ceph-detect-version" already exists
2019-11-04 00:02:30.801015 I | op-cluster: detecting the ceph image version for image registry.local/k8s/ceph:v14.2.4-20190917...
2019-11-04 00:02:30.822757 I | op-k8sutil: Removing previous job rook-ceph-detect-version to start a new one
2019-11-04 00:02:30.848724 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:32.853064 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:34.877305 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:36.880711 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:38.886512 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:40.897582 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:42.901645 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:44.905500 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:46.909520 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:48.913597 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:50.929526 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:52.933397 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:54.943151 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:56.947255 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:02:58.951218 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:03:00.958437 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:03:02.962419 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:03:04.981539 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:03:06.984733 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:03:08.990067 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-11-04 00:03:10.990254 W | op-k8sutil: gave up waiting for batch job rook-ceph-detect-version to be deleted
2019-11-04 00:03:14.024417 I | op-cluster: Detected ceph image version: 14.2.4 nautilus
2019-11-04 00:03:14.027408 E | cephconfig: clusterInfo: <nil>
2019-11-04 00:03:14.027439 I | op-cluster: CephCluster rook-ceph status: Creating. 
2019-11-04 00:03:14.053487 I | op-mon: start running mons
2019-11-04 00:03:14.059503 I | exec: Running command: ceph-authtool --create-keyring /var/lib/rook/rook-ceph/mon.keyring --gen-key -n mon. --cap mon 'allow *'
2019-11-04 00:03:14.117428 I | exec: Running command: ceph-authtool --create-keyring /var/lib/rook/rook-ceph/client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mgr 'allow *' --cap mds 'allow'
2019-11-04 00:03:14.187273 I | op-mon: creating mon secrets for a new cluster
2019-11-04 00:03:14.207934 I | op-mon: created csi secret for cluster rook-ceph
2019-11-04 00:03:14.213865 I | op-mon: saved mon endpoints to config map map[data: maxMonId:-1 mapping:{"node":{}} csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":[]}]]
2019-11-04 00:03:14.735904 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:03:14.736035 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:03:15.539813 I | op-mon: targeting the mon count 3
2019-11-04 00:03:15.553100 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-a-canary
2019-11-04 00:03:26.046526 I | op-mon: sched-mon: canary monitor deployment rook-ceph-mon-a-canary scheduled to khst310
2019-11-04 00:03:26.046566 I | op-mon: assignmon: mon a assigned to node khst310
2019-11-04 00:03:26.053839 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-b-canary
2019-11-04 00:03:26.087848 I | op-mon: sched-mon: waiting for canary pod creation rook-ceph-mon-b-canary
2019-11-04 00:03:31.109447 I | op-mon: sched-mon: canary monitor deployment rook-ceph-mon-b-canary scheduled to khst210
2019-11-04 00:03:31.109484 I | op-mon: assignmon: mon b assigned to node khst210
2019-11-04 00:03:31.116015 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-c-canary
2019-11-04 00:03:31.140898 I | op-mon: sched-mon: waiting for canary pod creation rook-ceph-mon-c-canary
2019-11-04 00:03:41.174200 I | op-mon: sched-mon: canary monitor deployment rook-ceph-mon-c-canary scheduled to khst110
2019-11-04 00:03:41.174237 I | op-mon: assignmon: mon c assigned to node khst110
2019-11-04 00:03:41.174252 I | op-mon: assignmon: cleaning up canary deployment rook-ceph-mon-a-canary and canary pvc 
2019-11-04 00:03:41.174264 I | op-k8sutil: removing deployment rook-ceph-mon-a-canary if it exists
2019-11-04 00:03:41.182081 I | op-k8sutil: Removed deployment rook-ceph-mon-a-canary
2019-11-04 00:03:41.185699 I | op-k8sutil: rook-ceph-mon-a-canary still found. waiting...
2019-11-04 00:03:43.189549 I | op-k8sutil: rook-ceph-mon-a-canary still found. waiting...
2019-11-04 00:03:45.193042 I | op-k8sutil: confirmed rook-ceph-mon-a-canary does not exist
2019-11-04 00:03:45.193125 I | op-mon: assignmon: cleaning up canary deployment rook-ceph-mon-b-canary and canary pvc 
2019-11-04 00:03:45.193150 I | op-k8sutil: removing deployment rook-ceph-mon-b-canary if it exists
2019-11-04 00:03:45.200450 I | op-k8sutil: Removed deployment rook-ceph-mon-b-canary
2019-11-04 00:03:45.203874 I | op-k8sutil: rook-ceph-mon-b-canary still found. waiting...
2019-11-04 00:03:47.208180 I | op-k8sutil: rook-ceph-mon-b-canary still found. waiting...
2019-11-04 00:03:49.212988 I | op-k8sutil: rook-ceph-mon-b-canary still found. waiting...
2019-11-04 00:03:51.217033 I | op-k8sutil: rook-ceph-mon-b-canary still found. waiting...
2019-11-04 00:03:53.221157 I | op-k8sutil: rook-ceph-mon-b-canary still found. waiting...
2019-11-04 00:03:55.226627 I | op-k8sutil: rook-ceph-mon-b-canary still found. waiting...
2019-11-04 00:03:57.230205 I | op-k8sutil: confirmed rook-ceph-mon-b-canary does not exist
2019-11-04 00:03:57.230242 I | op-mon: assignmon: cleaning up canary deployment rook-ceph-mon-c-canary and canary pvc 
2019-11-04 00:03:57.230255 I | op-k8sutil: removing deployment rook-ceph-mon-c-canary if it exists
2019-11-04 00:03:57.238926 I | op-k8sutil: Removed deployment rook-ceph-mon-c-canary
2019-11-04 00:03:57.242831 I | op-k8sutil: rook-ceph-mon-c-canary still found. waiting...
2019-11-04 00:03:59.247607 I | op-k8sutil: rook-ceph-mon-c-canary still found. waiting...
2019-11-04 00:04:01.250356 I | op-k8sutil: confirmed rook-ceph-mon-c-canary does not exist
2019-11-04 00:04:01.250413 I | op-mon: creating mon a
2019-11-04 00:04:01.275946 I | op-mon: mon a endpoint are [v2:10.101.93.128:3300,v1:10.101.93.128:6789]
2019-11-04 00:04:01.296590 I | op-mon: saved mon endpoints to config map map[data:a=10.101.93.128:6789 maxMonId:2 mapping:{"node":{"a":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"},"b":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"c":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"}}} csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.101.93.128:6789"]}]]
2019-11-04 00:04:01.306687 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:04:01.307001 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:04:01.307462 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:04:01.307643 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:04:01.317240 I | op-mon: waiting for mon quorum with [a]
2019-11-04 00:04:01.322068 I | op-mon: mon a is not yet running
2019-11-04 00:04:01.322086 I | op-mon: mons running: []
2019-11-04 00:04:01.322215 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/175301588
2019-11-04 00:04:11.747040 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:04:11.747093 I | op-mon: mons created: 1
2019-11-04 00:04:11.747223 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/941047587
2019-11-04 00:04:12.205783 I | op-mon: waiting for mon quorum with [a]
2019-11-04 00:04:12.213890 I | op-mon: mons running: [a]
2019-11-04 00:04:12.214021 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/200967718
2019-11-04 00:04:12.664431 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:04:12.664591 I | exec: Running command: ceph config set global mon_allow_pool_delete true --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/650893133
2019-11-04 00:04:13.152443 I | exec: Running command: ceph config set global bluestore_warn_on_legacy_statfs false --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/304892488
2019-11-04 00:04:13.668532 I | exec: Running command: ceph config set global rbd_default_features 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/091363079
2019-11-04 00:04:14.168023 I | op-mon: creating mon b
2019-11-04 00:04:14.220544 I | op-mon: mon a endpoint are [v2:10.101.93.128:3300,v1:10.101.93.128:6789]
2019-11-04 00:04:14.253336 I | op-mon: mon b endpoint are [v2:10.109.148.115:3300,v1:10.109.148.115:6789]
2019-11-04 00:04:14.270347 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.101.93.128:6789","10.109.148.115:6789"]}] data:a=10.101.93.128:6789,b=10.109.148.115:6789 maxMonId:2 mapping:{"node":{"a":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"},"b":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"c":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"}}}]
2019-11-04 00:04:14.277732 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:04:14.277956 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:04:14.278530 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:04:14.278638 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:04:14.282509 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed
2019-11-04 00:04:14.287128 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:04:14.287145 I | op-k8sutil: updating deployment rook-ceph-mon-a
2019-11-04 00:04:16.306445 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-a
2019-11-04 00:04:16.306472 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:04:16.306490 I | op-mon: waiting for mon quorum with [a b]
2019-11-04 00:04:16.319418 I | op-mon: mon b is not yet running
2019-11-04 00:04:16.319443 I | op-mon: mons running: [a]
2019-11-04 00:04:16.319562 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/314993850
2019-11-04 00:04:16.772461 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:04:16.810291 I | op-mon: waiting for mon quorum with [a b]
2019-11-04 00:04:16.883312 I | op-mon: mon b is not yet running
2019-11-04 00:04:16.883355 I | op-mon: mons running: [a]
2019-11-04 00:04:16.883538 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/724448977
2019-11-04 00:04:17.280591 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:04:17.280623 I | op-mon: mons created: 2
2019-11-04 00:04:17.280780 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/047920124
2019-11-04 00:04:17.741576 I | op-mon: waiting for mon quorum with [a b]
2019-11-04 00:04:17.757471 I | op-mon: mon b is not yet running
2019-11-04 00:04:17.757497 I | op-mon: mons running: [a]
2019-11-04 00:04:22.774786 I | op-mon: mons running: [a b]
2019-11-04 00:04:22.774959 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/794685227
2019-11-04 00:04:23.222829 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:04:23.222855 W | op-mon: monitor b is not in quorum list
2019-11-04 00:04:28.240241 I | op-mon: mons running: [a b]
2019-11-04 00:04:28.240421 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/197653646
2019-11-04 00:04:28.693275 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:04:28.693300 W | op-mon: monitor b is not in quorum list
2019-11-04 00:04:33.709790 I | op-mon: mons running: [a b]
2019-11-04 00:04:33.709954 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/493672085
2019-11-04 00:04:34.157504 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:04:34.157530 W | op-mon: monitor b is not in quorum list
2019-11-04 00:04:39.177446 I | op-mon: mons running: [a b]
2019-11-04 00:04:39.177605 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/426783472
2019-11-04 00:04:39.633094 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:04:39.633121 W | op-mon: monitor b is not in quorum list
2019-11-04 00:04:44.651585 I | op-mon: mons running: [a b]
2019-11-04 00:04:44.651775 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/721409935
2019-11-04 00:04:45.095035 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:04:45.095061 W | op-mon: monitor b is not in quorum list
2019-11-04 00:04:50.112087 I | op-mon: mons running: [a b]
2019-11-04 00:04:50.112260 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/123509154
2019-11-04 00:04:50.582186 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:04:50.582228 W | op-mon: monitor b is not in quorum list
2019-11-04 00:04:55.603857 I | op-mon: mons running: [a b]
2019-11-04 00:04:55.604027 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/018376345
2019-11-04 00:04:56.067555 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:04:56.067580 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:01.089728 I | op-mon: mons running: [a b]
2019-11-04 00:05:01.089907 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/185518372
2019-11-04 00:05:01.542087 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:01.542113 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:06.560294 I | op-mon: mons running: [a b]
2019-11-04 00:05:06.560542 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/631944243
2019-11-04 00:05:07.026287 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:07.026317 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:12.042529 I | op-mon: mons running: [a b]
2019-11-04 00:05:12.042706 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/643587062
2019-11-04 00:05:12.498117 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:12.498151 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:17.514345 I | op-mon: mons running: [a b]
2019-11-04 00:05:17.514514 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/514443485
2019-11-04 00:05:17.988336 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:17.988358 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:23.005089 I | op-mon: mons running: [a b]
2019-11-04 00:05:23.005367 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/454635672
2019-11-04 00:05:23.459433 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:23.459465 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:28.478333 I | op-mon: mons running: [a b]
2019-11-04 00:05:28.478519 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/527621911
2019-11-04 00:05:28.931988 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:28.932017 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:33.947859 I | op-mon: mons running: [a b]
2019-11-04 00:05:33.948046 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/876192650
2019-11-04 00:05:34.406319 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:34.406351 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:39.423826 I | op-mon: mons running: [a b]
2019-11-04 00:05:39.423998 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/913976673
2019-11-04 00:05:39.868527 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:39.868557 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:44.885154 I | op-mon: mons running: [a b]
2019-11-04 00:05:44.885363 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/158745420
2019-11-04 00:05:45.362302 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:45.362331 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:50.378798 I | op-mon: mons running: [a b]
2019-11-04 00:05:50.378983 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/504213051
2019-11-04 00:05:50.847196 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:50.847224 W | op-mon: monitor b is not in quorum list
2019-11-04 00:05:55.861933 I | op-mon: mons running: [a b]
2019-11-04 00:05:55.862107 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/978275934
2019-11-04 00:05:56.314933 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:05:56.314962 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:01.331556 I | op-mon: mons running: [a b]
2019-11-04 00:06:01.331741 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/927111205
2019-11-04 00:06:01.782989 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:01.783019 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:06.799691 I | op-mon: mons running: [a b]
2019-11-04 00:06:06.799878 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/697585472
2019-11-04 00:06:07.262829 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:07.262859 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:12.279519 I | op-mon: mons running: [a b]
2019-11-04 00:06:12.279705 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/976778655
2019-11-04 00:06:12.733248 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:12.733286 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:17.748896 I | op-mon: mons running: [a b]
2019-11-04 00:06:17.749060 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/584316018
2019-11-04 00:06:18.199445 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:18.199478 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:23.214937 I | op-mon: mons running: [a b]
2019-11-04 00:06:23.215101 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/778718505
2019-11-04 00:06:23.680457 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:23.680490 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:28.696729 I | op-mon: mons running: [a b]
2019-11-04 00:06:28.696911 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/073310324
2019-11-04 00:06:29.167579 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:29.167612 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:34.185588 I | op-mon: mons running: [a b]
2019-11-04 00:06:34.185775 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/939147587
2019-11-04 00:06:34.657081 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:34.657121 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:39.673537 I | op-mon: mons running: [a b]
2019-11-04 00:06:39.673732 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/073431494
2019-11-04 00:06:40.130070 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:40.130100 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:45.145841 I | op-mon: mons running: [a b]
2019-11-04 00:06:45.146023 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/392145005
2019-11-04 00:06:45.623470 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:45.623504 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:50.639309 I | op-mon: mons running: [a b]
2019-11-04 00:06:50.639492 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/142142184
2019-11-04 00:06:51.093877 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:51.093909 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:56.108449 I | op-mon: mons running: [a b]
2019-11-04 00:06:56.108647 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/908825383
2019-11-04 00:06:56.569647 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:06:56.569681 W | op-mon: monitor b is not in quorum list
2019-11-04 00:06:56.569728 E | op-cluster: failed to create cluster in namespace rook-ceph. failed to start the mons. failed to start mon pods. failed to wait for mon quorum. exceeded max retry count waiting for monitors to reach quorum
2019-11-04 00:07:00.800997 I | op-cluster: detecting the ceph image version for image registry.local/k8s/ceph:v14.2.4-20190917...
2019-11-04 00:07:04.156638 I | op-cluster: Detected ceph image version: 14.2.4 nautilus
2019-11-04 00:07:04.171428 I | op-mon: created csi secret for cluster rook-ceph
2019-11-04 00:07:04.173786 I | op-mon: parsing mon endpoints: a=10.101.93.128:6789,b=10.109.148.115:6789
2019-11-04 00:07:04.173887 I | op-mon: loaded: maxMonID=2, mons=map[a:0xc00650f3c0 b:0xc00650f420], mapping=&{Node:map[a:0xc0100bab40 b:0xc0100bab70 c:0xc0100baba0]}
2019-11-04 00:07:04.174381 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:07:04.174543 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:07:04.174656 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/088447066
2019-11-04 00:07:04.644953 I | op-cluster: CephCluster rook-ceph status: Creating. 
2019-11-04 00:07:04.673559 I | op-mon: start running mons
2019-11-04 00:07:04.683107 I | op-mon: created csi secret for cluster rook-ceph
2019-11-04 00:07:04.686366 I | op-mon: parsing mon endpoints: a=10.101.93.128:6789,b=10.109.148.115:6789
2019-11-04 00:07:04.686456 I | op-mon: loaded: maxMonID=2, mons=map[a:0xc0064989a0 b:0xc0064989e0], mapping=&{Node:map[a:0xc0139fc1e0 b:0xc0139fc210 c:0xc0139fc240]}
2019-11-04 00:07:04.697698 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.101.93.128:6789","10.109.148.115:6789"]}] data:a=10.101.93.128:6789,b=10.109.148.115:6789 maxMonId:2 mapping:{"node":{"a":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"},"b":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"c":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"}}}]
2019-11-04 00:07:04.902777 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:07:04.902988 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:07:05.901666 I | op-mon: targeting the mon count 3
2019-11-04 00:07:05.908631 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-d-canary
2019-11-04 00:07:06.702541 I | op-mon: sched-mon: canary monitor deployment rook-ceph-mon-d-canary scheduled to khst110
2019-11-04 00:07:06.702573 I | op-mon: assignmon: mon d assigned to node khst110
2019-11-04 00:07:06.702587 I | op-mon: assignmon: cleaning up canary deployment rook-ceph-mon-d-canary and canary pvc 
2019-11-04 00:07:06.702599 I | op-k8sutil: removing deployment rook-ceph-mon-d-canary if it exists
2019-11-04 00:07:06.710298 I | op-k8sutil: Removed deployment rook-ceph-mon-d-canary
2019-11-04 00:07:06.713926 I | op-k8sutil: rook-ceph-mon-d-canary still found. waiting...
2019-11-04 00:07:08.718385 I | op-k8sutil: rook-ceph-mon-d-canary still found. waiting...
2019-11-04 00:07:10.726630 I | op-k8sutil: confirmed rook-ceph-mon-d-canary does not exist
2019-11-04 00:07:10.726836 I | exec: Running command: ceph config set global mon_allow_pool_delete true --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/686000113
2019-11-04 00:07:11.172122 I | exec: Running command: ceph config set global bluestore_warn_on_legacy_statfs false --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/886214300
2019-11-04 00:07:11.635944 I | exec: Running command: ceph config set global rbd_default_features 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/547965259
2019-11-04 00:07:12.065829 I | op-mon: creating mon d
2019-11-04 00:07:12.112951 I | op-mon: mon b endpoint are [v2:10.109.148.115:3300,v1:10.109.148.115:6789]
2019-11-04 00:07:12.156947 I | op-mon: mon a endpoint are [v2:10.101.93.128:3300,v1:10.101.93.128:6789]
2019-11-04 00:07:12.189574 I | op-mon: mon d endpoint are [v2:10.107.237.5:3300,v1:10.107.237.5:6789]
2019-11-04 00:07:12.271294 I | op-mon: saved mon endpoints to config map map[data:a=10.101.93.128:6789,b=10.109.148.115:6789,d=10.107.237.5:6789 maxMonId:3 mapping:{"node":{"a":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"},"b":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"c":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"},"d":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"}}} csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.109.148.115:6789","10.107.237.5:6789","10.101.93.128:6789"]}]]
2019-11-04 00:07:12.670985 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:07:12.671211 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:07:12.671858 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:07:12.671957 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:07:12.675856 I | op-mon: deployment for mon rook-ceph-mon-b already exists. updating if needed
2019-11-04 00:07:12.679338 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:07:12.679353 I | op-k8sutil: updating deployment rook-ceph-mon-b
2019-11-04 00:07:14.699661 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-b
2019-11-04 00:07:14.699693 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:07:14.699715 I | op-mon: waiting for mon quorum with [b a d]
2019-11-04 00:07:14.720643 I | op-mon: mon d is not yet running
2019-11-04 00:07:14.720672 I | op-mon: mons running: [b a]
2019-11-04 00:07:14.720812 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/530860078
2019-11-04 00:07:15.179286 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:07:15.183578 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed
2019-11-04 00:07:15.193178 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:07:15.193204 I | op-k8sutil: updating deployment rook-ceph-mon-a
2019-11-04 00:07:17.216198 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-a
2019-11-04 00:07:17.216231 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:07:17.216254 I | op-mon: waiting for mon quorum with [b a d]
2019-11-04 00:07:17.240025 I | op-mon: mon d is not yet running
2019-11-04 00:07:17.240061 I | op-mon: mons running: [b a]
2019-11-04 00:07:17.240203 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/328127925
2019-11-04 00:07:17.694129 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:07:17.706130 I | op-mon: waiting for mon quorum with [b a d]
2019-11-04 00:07:17.740812 I | op-mon: mon d is not yet running
2019-11-04 00:07:17.740843 I | op-mon: mons running: [b a]
2019-11-04 00:07:17.740958 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/250668944
2019-11-04 00:07:18.198144 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:07:18.198185 I | op-mon: mons created: 3
2019-11-04 00:07:18.198327 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/089544623
2019-11-04 00:07:18.685365 I | op-mon: waiting for mon quorum with [b a d]
2019-11-04 00:07:18.730556 I | op-mon: mon d is not yet running
2019-11-04 00:07:18.730585 I | op-mon: mons running: [b a]
2019-11-04 00:07:23.755386 I | op-mon: mons running: [b a d]
2019-11-04 00:07:23.755557 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/027430210
2019-11-04 00:07:24.205463 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:07:24.205497 W | op-mon: monitor b is not in quorum list
2019-11-04 00:07:29.228634 I | op-mon: mons running: [b a d]
2019-11-04 00:07:29.228813 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/155881913
2019-11-04 00:07:29.688457 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:07:29.688490 W | op-mon: monitor b is not in quorum list
2019-11-04 00:07:34.712147 I | op-mon: mons running: [b a d]
2019-11-04 00:07:34.712369 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/999389124
2019-11-04 00:07:35.168677 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:07:35.168710 W | op-mon: monitor b is not in quorum list
2019-11-04 00:07:40.192637 I | op-mon: mons running: [b a d]
2019-11-04 00:07:40.192827 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/850847315
2019-11-04 00:07:40.660665 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:07:40.660702 W | op-mon: monitor b is not in quorum list
2019-11-04 00:07:45.689470 I | op-mon: mons running: [b a d]
2019-11-04 00:07:45.689628 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/764983190
2019-11-04 00:07:46.133019 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:07:46.133053 W | op-mon: monitor b is not in quorum list
2019-11-04 00:07:51.158691 I | op-mon: mons running: [b a d]
2019-11-04 00:07:51.158874 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/409026557
2019-11-04 00:07:51.623359 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:07:51.623393 W | op-mon: monitor b is not in quorum list
2019-11-04 00:07:56.647632 I | op-mon: mons running: [b a d]
2019-11-04 00:07:56.647811 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/857400632
2019-11-04 00:07:57.113319 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:07:57.113365 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:02.138553 I | op-mon: mons running: [b a d]
2019-11-04 00:08:02.138759 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/916565303
2019-11-04 00:08:02.612870 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:02.612903 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:07.641997 I | op-mon: mons running: [b a d]
2019-11-04 00:08:07.642178 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/476897066
2019-11-04 00:08:08.112567 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:08.112597 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:13.138355 I | op-mon: mons running: [b a d]
2019-11-04 00:08:13.138547 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/196639873
2019-11-04 00:08:13.585044 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:13.585072 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:18.607561 I | op-mon: mons running: [b a d]
2019-11-04 00:08:18.607733 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/965316588
2019-11-04 00:08:19.072645 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:19.072682 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:24.096401 I | op-mon: mons running: [b a d]
2019-11-04 00:08:24.096655 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/306005595
2019-11-04 00:08:24.556361 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:24.556395 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:29.580127 I | op-mon: mons running: [b a d]
2019-11-04 00:08:29.580310 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/754887678
2019-11-04 00:08:30.037581 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:30.037613 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:35.065690 I | op-mon: mons running: [b a d]
2019-11-04 00:08:35.065913 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/297511237
2019-11-04 00:08:35.525015 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:35.525051 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:40.551989 I | op-mon: mons running: [b a d]
2019-11-04 00:08:40.552174 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/881475552
2019-11-04 00:08:41.015265 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:41.015298 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:46.040818 I | op-mon: mons running: [b a d]
2019-11-04 00:08:46.040992 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/803872703
2019-11-04 00:08:46.511932 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:46.511964 W | op-mon: monitor b is not in quorum list
W1104 00:08:49.979123       8 reflector.go:289] github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:173: watch of *v1.ConfigMap ended with: too old resource version: 677753283 (677778908)
2019-11-04 00:08:51.592142 I | op-mon: mons running: [b a d]
2019-11-04 00:08:51.592333 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/897568786
2019-11-04 00:08:52.042575 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:52.042611 W | op-mon: monitor b is not in quorum list
2019-11-04 00:08:57.065460 I | op-mon: mons running: [b a d]
2019-11-04 00:08:57.065627 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/057205321
2019-11-04 00:08:57.545779 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:08:57.545809 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:02.573617 I | op-mon: mons running: [b a d]
2019-11-04 00:09:02.573806 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/649923348
2019-11-04 00:09:03.028550 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:03.028583 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:08.054388 I | op-mon: mons running: [b a d]
2019-11-04 00:09:08.054567 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/607184739
2019-11-04 00:09:08.506427 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:08.506456 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:13.531645 I | op-mon: mons running: [b a d]
2019-11-04 00:09:13.531826 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/041396582
2019-11-04 00:09:13.983760 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:13.983789 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:19.011193 I | op-mon: mons running: [b a d]
2019-11-04 00:09:19.011369 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/499228045
2019-11-04 00:09:19.467131 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:19.467166 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:24.488735 I | op-mon: mons running: [b a d]
2019-11-04 00:09:24.488911 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/562013576
2019-11-04 00:09:24.968645 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:24.968672 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:29.993403 I | op-mon: mons running: [b a d]
2019-11-04 00:09:29.993582 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/129721159
2019-11-04 00:09:30.440087 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:30.440121 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:35.467196 I | op-mon: mons running: [b a d]
2019-11-04 00:09:35.467367 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/983284730
2019-11-04 00:09:35.915442 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:35.915471 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:40.940138 I | op-mon: mons running: [b a d]
2019-11-04 00:09:40.940310 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/035785489
2019-11-04 00:09:41.408017 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:41.408049 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:46.433594 I | op-mon: mons running: [b a d]
2019-11-04 00:09:46.433764 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/622554940
2019-11-04 00:09:46.878576 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:46.878605 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:51.904474 I | op-mon: mons running: [b a d]
2019-11-04 00:09:51.904628 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/839534443
2019-11-04 00:09:52.362320 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:52.362355 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:57.397787 I | op-mon: mons running: [b a d]
2019-11-04 00:09:57.397992 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/638310350
2019-11-04 00:09:57.844565 W | op-mon: failed to find initial monitor b in mon map
2019-11-04 00:09:57.844597 W | op-mon: monitor b is not in quorum list
2019-11-04 00:09:57.844623 E | op-cluster: failed to create cluster in namespace rook-ceph. failed to start the mons. failed to start mon pods. failed to wait for mon quorum. exceeded max retry count waiting for monitors to reach quorum
2019-11-04 00:10:00.801045 I | op-cluster: detecting the ceph image version for image registry.local/k8s/ceph:v14.2.4-20190917...
2019-11-04 00:10:03.701089 I | op-cluster: Detected ceph image version: 14.2.4 nautilus
2019-11-04 00:10:03.709893 I | op-mon: created csi secret for cluster rook-ceph
2019-11-04 00:10:03.712886 I | op-mon: parsing mon endpoints: a=10.101.93.128:6789,b=10.109.148.115:6789,d=10.107.237.5:6789
2019-11-04 00:10:03.713024 I | op-mon: loaded: maxMonID=3, mons=map[a:0xc013990ee0 b:0xc013990f20 d:0xc013990f60], mapping=&{Node:map[a:0xc013994e10 b:0xc013994e40 c:0xc013994e70 d:0xc013994ea0]}
2019-11-04 00:10:03.713730 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:10:03.713974 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:10:03.714218 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/043055317
2019-11-04 00:10:04.187964 I | op-cluster: CephCluster rook-ceph status: Creating. 
2019-11-04 00:10:04.209890 I | op-mon: start running mons
2019-11-04 00:10:04.218842 I | op-mon: created csi secret for cluster rook-ceph
2019-11-04 00:10:04.221671 I | op-mon: parsing mon endpoints: a=10.101.93.128:6789,b=10.109.148.115:6789,d=10.107.237.5:6789
2019-11-04 00:10:04.221769 I | op-mon: loaded: maxMonID=3, mons=map[a:0xc00868df00 b:0xc00868df40 d:0xc00868df80], mapping=&{Node:map[a:0xc012af7950 b:0xc012af7980 c:0xc012af79b0 d:0xc012af79e0]}
2019-11-04 00:10:04.292687 I | op-mon: saved mon endpoints to config map map[data:a=10.101.93.128:6789,b=10.109.148.115:6789,d=10.107.237.5:6789 maxMonId:3 mapping:{"node":{"a":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"},"b":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"c":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"},"d":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"}}} csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.101.93.128:6789","10.109.148.115:6789","10.107.237.5:6789"]}]]
2019-11-04 00:10:04.687119 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:10:04.687333 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:10:05.890068 I | op-mon: targeting the mon count 3
2019-11-04 00:10:05.890276 I | exec: Running command: ceph config set global mon_allow_pool_delete true --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/675633200
2019-11-04 00:10:06.345009 I | exec: Running command: ceph config set global bluestore_warn_on_legacy_statfs false --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/452982223
2019-11-04 00:10:06.791254 I | exec: Running command: ceph config set global rbd_default_features 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/768495842
2019-11-04 00:10:07.240022 I | op-mon: checking for basic quorum with existing mons
2019-11-04 00:10:07.325862 I | op-mon: mon a endpoint are [v2:10.101.93.128:3300,v1:10.101.93.128:6789]
2019-11-04 00:10:07.487731 I | op-mon: mon b endpoint are [v2:10.109.148.115:3300,v1:10.109.148.115:6789]
2019-11-04 00:10:08.489391 I | op-mon: mon d endpoint are [v2:10.107.237.5:3300,v1:10.107.237.5:6789]
2019-11-04 00:10:09.288946 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.101.93.128:6789","10.109.148.115:6789","10.107.237.5:6789"]}] data:d=10.107.237.5:6789,a=10.101.93.128:6789,b=10.109.148.115:6789 maxMonId:3 mapping:{"node":{"a":{"Name":"khst310","Hostname":"khst310","Address":"192.168.11.81"},"b":{"Name":"khst210","Hostname":"khst210","Address":"192.168.11.21"},"c":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"},"d":{"Name":"khst110","Hostname":"khst110","Address":"192.168.10.181"}}}]
2019-11-04 00:10:09.687394 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:10:09.687657 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:10:09.688317 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-11-04 00:10:09.688485 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-11-04 00:10:09.692764 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed
2019-11-04 00:10:09.696750 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:10:09.696780 I | op-k8sutil: updating deployment rook-ceph-mon-a
2019-11-04 00:10:11.715469 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-a
2019-11-04 00:10:11.715505 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:10:11.715526 I | op-mon: waiting for mon quorum with [a b d]
2019-11-04 00:10:11.743623 I | op-mon: mons running: [a b d]
2019-11-04 00:10:11.743796 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/064141529
2019-11-04 00:10:12.201045 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:10:12.205709 I | op-mon: deployment for mon rook-ceph-mon-b already exists. updating if needed
2019-11-04 00:10:12.209392 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:10:12.209418 I | op-k8sutil: updating deployment rook-ceph-mon-b
2019-11-04 00:10:14.226895 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-b
2019-11-04 00:10:14.226928 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:10:14.226949 I | op-mon: waiting for mon quorum with [a b d]
2019-11-04 00:10:14.251218 I | op-mon: mons running: [a b d]
2019-11-04 00:10:14.251418 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/643486820
2019-11-04 00:10:14.699285 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:10:14.704106 I | op-mon: deployment for mon rook-ceph-mon-d already exists. updating if needed
2019-11-04 00:10:14.708349 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:10:14.708376 I | op-k8sutil: updating deployment rook-ceph-mon-d
2019-11-04 00:10:16.726760 I | op-k8sutil: finished waiting for updated deployment rook-ceph-mon-d
2019-11-04 00:10:16.726793 I | op-mon: this is not an upgrade, not performing upgrade checks
2019-11-04 00:10:16.726814 I | op-mon: waiting for mon quorum with [a b d]
2019-11-04 00:10:16.749884 I | op-mon: mons running: [a b d]
2019-11-04 00:10:16.750045 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/493851763
2019-11-04 00:10:17.202010 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:10:17.202040 I | op-mon: mons created: 3
2019-11-04 00:10:17.202158 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/944028470
2019-11-04 00:10:17.676460 I | op-mon: waiting for mon quorum with [a b d]
2019-11-04 00:10:17.720322 I | op-mon: mons running: [a b d]
2019-11-04 00:10:17.720463 I | exec: Running command: ceph mon_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/642050333
2019-11-04 00:10:18.179773 I | op-mon: Monitors in quorum: [a]
2019-11-04 00:10:18.179903 I | exec: Running command: ceph version --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/096145880
2019-11-04 00:10:18.634965 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/997928279
2019-11-04 00:10:19.106803 I | exec: Running command: ceph mon enable-msgr2 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/275745994
2019-11-04 00:10:19.510628 I | cephclient: successfully enabled msgr2 protocol
2019-11-04 00:10:19.510695 I | op-mgr: start running mgr
2019-11-04 00:10:19.510850 I | exec: Running command: ceph auth get-or-create-key mgr.a mon allow * mds allow * osd allow * --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/348047265
2019-11-04 00:10:20.052471 I | exec: Running command: ceph config get mgr.a mgr/dashboard/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/803024012
2019-11-04 00:10:20.491552 I | exec: Error ENOENT: 
2019-11-04 00:10:20.491778 I | exec: Running command: ceph config rm mgr.a mgr/dashboard/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/274422395
2019-11-04 00:10:20.931495 I | op-mgr: clearing http bind fix mod=dashboard ver=13.0.0 mimic changed=false err=<nil>
2019-11-04 00:10:20.931635 I | exec: Running command: ceph config get mgr.a mgr/dashboard/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/042177950
2019-11-04 00:10:21.385719 I | exec: Error ENOENT: 
2019-11-04 00:10:21.385977 I | exec: Running command: ceph config rm mgr.a mgr/dashboard/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/730707557
2019-11-04 00:10:21.818422 I | op-mgr: clearing http bind fix mod=dashboard ver=13.0.0 mimic changed=false err=<nil>
2019-11-04 00:10:21.818557 I | exec: Running command: ceph config get mgr.a mgr/prometheus/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/646100608
2019-11-04 00:10:22.262811 I | exec: Error ENOENT: 
2019-11-04 00:10:22.263027 I | exec: Running command: ceph config rm mgr.a mgr/prometheus/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/407451615
2019-11-04 00:10:22.691722 I | op-mgr: clearing http bind fix mod=prometheus ver=13.0.0 mimic changed=false err=<nil>
2019-11-04 00:10:22.691919 I | exec: Running command: ceph config get mgr.a mgr/prometheus/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/422618034
2019-11-04 00:10:23.113978 I | exec: Error ENOENT: 
2019-11-04 00:10:23.114179 I | exec: Running command: ceph config rm mgr.a mgr/prometheus/a/server_addr --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/783405417
2019-11-04 00:10:23.545586 I | op-mgr: clearing http bind fix mod=prometheus ver=13.0.0 mimic changed=false err=<nil>
2019-11-04 00:10:23.554501 I | exec: Running command: ceph mgr module enable orchestrator_cli --force --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/808464820
2019-11-04 00:10:24.045125 I | exec: module 'orchestrator_cli' is already enabled (always-on)
2019-11-04 00:10:24.045377 I | exec: Running command: ceph mgr module enable rook --force --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/175297411
2019-11-04 00:10:25.513058 I | exec: Running command: ceph orchestrator set backend rook --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/952295174
2019-11-04 00:10:26.078813 I | exec: no valid command found; 10 closest matches:
mon dump {<int[0-]>}
mon stat
fs set-default <fs_name>
fs add_data_pool <fs_name> <pool>
fs rm_data_pool <fs_name> <pool>
fs set <fs_name> max_mds|max_file_size|allow_new_snaps|inline_data|cluster_down|allow_dirfrags|balancer|standby_count_wanted|session_timeout|session_autoclose|allow_standby_replay|down|joinable|min_compat_client <val> {--yes-i-really-mean-it}
fs flag set enable_multiple <val> {--yes-i-really-mean-it}
fs ls
fs get <fs_name>
osd tree-from {<int[0-]>} <bucket> {up|down|in|out|destroyed [up|down|in|out|destroyed...]}
Error EINVAL: invalid command
2019-11-04 00:10:26.078909 I | cephclient: command failed. trying again...
2019-11-04 00:10:31.079236 I | exec: Running command: ceph orchestrator set backend rook --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/162776237
2019-11-04 00:10:31.710443 I | cephclient: command succeeded on attempt 1
2019-11-04 00:10:31.710589 I | exec: Running command: ceph mgr module enable prometheus --force --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/160963624
2019-11-04 00:10:32.835299 I | exec: Running command: ceph mgr module enable dashboard --force --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/174784359
2019-11-04 00:10:38.850297 I | exec: Running command: ceph dashboard create-self-signed-cert --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/828406170
2019-11-04 00:10:39.930839 I | op-mgr: Running command: ceph dashboard set-login-credentials admin *******
2019-11-04 00:10:40.831748 I | op-mgr: restarting the mgr module
2019-11-04 00:10:40.831904 I | exec: Running command: ceph mgr module disable dashboard --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/217620444
2019-11-04 00:10:42.115187 I | exec: Running command: ceph mgr module enable dashboard --force --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/184743819
2019-11-04 00:10:43.166157 I | exec: Running command: ceph config get mgr.a mgr/dashboard/url_prefix --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/485266798
2019-11-04 00:10:43.712947 I | exec: Running command: ceph config rm mgr.a mgr/dashboard/url_prefix --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/785389557
2019-11-04 00:10:44.265482 I | exec: Running command: ceph config get mgr.a mgr/dashboard/ssl --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/255213776
2019-11-04 00:10:44.798505 I | exec: Running command: ceph config set mgr.a mgr/dashboard/ssl true --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/638621167
2019-11-04 00:10:45.368900 I | exec: Running command: ceph config get mgr.a mgr/dashboard/server_port --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/635612802
2019-11-04 00:10:45.882066 I | exec: Running command: ceph config set mgr.a mgr/dashboard/server_port 8443 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/442762233
2019-11-04 00:10:46.475363 I | exec: Running command: ceph config get mgr.a mgr/dashboard/ssl_server_port --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/410133252
2019-11-04 00:10:46.972893 I | exec: Running command: ceph config set mgr.a mgr/dashboard/ssl_server_port 8443 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/603424915
2019-11-04 00:10:47.544855 I | op-mgr: dashboard config has changed
2019-11-04 00:10:47.544885 I | op-mgr: restarting the mgr module
2019-11-04 00:10:47.544987 I | exec: Running command: ceph mgr module disable dashboard --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/948169942
2019-11-04 00:10:48.589771 I | exec: Running command: ceph mgr module enable dashboard --force --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/968720445
2019-11-04 00:10:49.618943 I | op-mgr: dashboard service started
2019-11-04 00:10:49.650840 I | op-mgr: mgr metrics service started
2019-11-04 00:10:49.650879 I | op-osd: start running osds in namespace rook-ceph
2019-11-04 00:10:49.650889 I | op-osd: start provisioning the osds on pvcs, if needed
2019-11-04 00:10:49.650900 W | op-osd: no valid volumeSource available to run an osd in namespace rook-ceph. Rook will not create any new OSD nodes and will skip checking for removed pvcs removing all OSD nodes without destroying the Rook cluster is unlikely to be intentional
2019-11-04 00:10:49.650908 I | op-osd: start provisioning the osds on nodes, if needed
2019-11-04 00:10:49.864463 I | op-osd: 17 of the 17 storage nodes are valid
2019-11-04 00:10:49.881425 I | op-osd: osd provision job started for node khst110
2019-11-04 00:10:49.900431 I | op-osd: osd provision job started for node khst120
2019-11-04 00:10:49.950638 I | op-osd: osd provision job started for node khst130
2019-11-04 00:10:50.126086 I | op-osd: osd provision job started for node khst140
2019-11-04 00:10:50.528918 I | op-osd: osd provision job started for node khst150
2019-11-04 00:10:51.328353 I | op-osd: osd provision job started for node khst160
2019-11-04 00:10:51.942403 I | op-osd: osd provision job started for node khst210
2019-11-04 00:10:52.526451 I | op-osd: osd provision job started for node khst220
2019-11-04 00:10:52.927752 I | op-osd: osd provision job started for node khst230
2019-11-04 00:10:53.326270 I | op-osd: osd provision job started for node khst240
2019-11-04 00:10:54.141359 I | op-osd: osd provision job started for node khst250
2019-11-04 00:10:54.934803 I | op-osd: osd provision job started for node khst260
2019-11-04 00:10:55.333351 I | op-osd: osd provision job started for node khst310
2019-11-04 00:10:55.729178 I | op-osd: osd provision job started for node khst320
2019-11-04 00:10:56.330836 I | op-osd: osd provision job started for node khst330
2019-11-04 00:10:56.927127 I | op-osd: osd provision job started for node khst340
2019-11-04 00:10:57.725780 I | op-osd: osd provision job started for node khst350
2019-11-04 00:10:57.725808 I | op-osd: start osds after provisioning is completed, if needed
2019-11-04 00:10:57.918366 I | op-osd: osd orchestration status for node khst110 is orchestrating
2019-11-04 00:10:57.918408 I | op-osd: osd orchestration status for node khst120 is orchestrating
2019-11-04 00:10:57.918428 I | op-osd: osd orchestration status for node khst130 is orchestrating
2019-11-04 00:10:57.918446 I | op-osd: osd orchestration status for node khst140 is orchestrating
2019-11-04 00:10:57.918478 I | op-osd: osd orchestration status for node khst150 is orchestrating
2019-11-04 00:10:57.918505 I | op-osd: osd orchestration status for node khst160 is orchestrating
2019-11-04 00:10:57.918521 I | op-osd: osd orchestration status for node khst210 is orchestrating
2019-11-04 00:10:57.918549 I | op-osd: osd orchestration status for node khst220 is orchestrating
2019-11-04 00:10:57.918585 I | op-osd: osd orchestration status for node khst230 is orchestrating
2019-11-04 00:10:57.918617 I | op-osd: osd orchestration status for node khst240 is orchestrating
2019-11-04 00:10:57.918638 I | op-osd: osd orchestration status for node khst250 is orchestrating
2019-11-04 00:10:57.918659 I | op-osd: osd orchestration status for node khst260 is computingDiff
2019-11-04 00:10:57.918693 I | op-osd: osd orchestration status for node khst310 is starting
2019-11-04 00:10:57.918719 I | op-osd: osd orchestration status for node khst320 is starting
2019-11-04 00:10:57.918742 I | op-osd: osd orchestration status for node khst330 is starting
2019-11-04 00:10:57.918757 I | op-osd: osd orchestration status for node khst340 is starting
2019-11-04 00:10:57.918787 I | op-osd: osd orchestration status for node khst350 is starting
2019-11-04 00:10:57.918800 I | op-osd: 0/17 node(s) completed osd provisioning, resource version 677795059
2019-11-04 00:10:58.025188 I | op-osd: osd orchestration status for node khst310 is computingDiff
2019-11-04 00:10:58.251923 I | op-osd: osd orchestration status for node khst320 is computingDiff
2019-11-04 00:10:58.382444 I | op-osd: osd orchestration status for node khst310 is orchestrating
2019-11-04 00:10:58.644658 I | op-osd: osd orchestration status for node khst320 is orchestrating
2019-11-04 00:10:58.686252 I | op-osd: osd orchestration status for node khst260 is orchestrating
2019-11-04 00:10:58.836963 I | op-osd: osd orchestration status for node khst330 is computingDiff
2019-11-04 00:10:59.329674 I | op-osd: osd orchestration status for node khst330 is orchestrating
2019-11-04 00:10:59.480947 I | op-osd: osd orchestration status for node khst340 is computingDiff
2019-11-04 00:10:59.901407 I | op-osd: osd orchestration status for node khst340 is orchestrating
2019-11-04 00:11:00.520430 I | op-osd: osd orchestration status for node khst350 is computingDiff
2019-11-04 00:11:01.707430 I | op-osd: osd orchestration status for node khst350 is orchestrating
2019-11-04 00:11:10.720645 I | op-osd: osd orchestration status for node khst140 is completed
2019-11-04 00:11:10.720676 I | op-osd: starting 0 osd daemons on node khst140
2019-11-04 00:11:11.686714 I | op-osd: osd orchestration status for node khst110 is completed
2019-11-04 00:11:11.686742 I | op-osd: starting 0 osd daemons on node khst110
2019-11-04 00:11:12.221435 I | op-osd: osd orchestration status for node khst130 is completed
2019-11-04 00:11:12.221475 I | op-osd: starting 0 osd daemons on node khst130
2019-11-04 00:11:12.289204 I | op-osd: osd orchestration status for node khst150 is completed
2019-11-04 00:11:12.289277 I | op-osd: starting 0 osd daemons on node khst150
2019-11-04 00:11:13.709498 I | op-osd: osd orchestration status for node khst240 is completed
2019-11-04 00:11:13.709527 I | op-osd: starting 0 osd daemons on node khst240
2019-11-04 00:11:13.838994 I | op-osd: osd orchestration status for node khst210 is completed
2019-11-04 00:11:13.839020 I | op-osd: starting 0 osd daemons on node khst210
2019-11-04 00:11:13.998417 I | op-osd: osd orchestration status for node khst230 is completed
2019-11-04 00:11:13.998442 I | op-osd: starting 0 osd daemons on node khst230
2019-11-04 00:11:15.232139 I | op-osd: osd orchestration status for node khst120 is completed
2019-11-04 00:11:15.232173 I | op-osd: starting 0 osd daemons on node khst120
2019-11-04 00:11:15.543950 I | op-osd: osd orchestration status for node khst250 is completed
2019-11-04 00:11:15.543976 I | op-osd: starting 0 osd daemons on node khst250
2019-11-04 00:11:17.446077 I | op-osd: osd orchestration status for node khst320 is completed
2019-11-04 00:11:17.446107 I | op-osd: starting 0 osd daemons on node khst320
2019-11-04 00:11:17.657958 I | op-osd: osd orchestration status for node khst310 is completed
2019-11-04 00:11:17.657984 I | op-osd: starting 0 osd daemons on node khst310
2019-11-04 00:11:18.350949 I | op-osd: osd orchestration status for node khst330 is completed
2019-11-04 00:11:18.350980 I | op-osd: starting 0 osd daemons on node khst330
2019-11-04 00:11:21.107870 I | op-osd: osd orchestration status for node khst220 is completed
2019-11-04 00:11:21.107905 I | op-osd: starting 0 osd daemons on node khst220
2019-11-04 00:11:26.340571 I | op-osd: osd orchestration status for node khst340 is completed
2019-11-04 00:11:26.340606 I | op-osd: starting 0 osd daemons on node khst340
2019-11-04 00:12:26.349592 I | op-osd: waiting on orchestration status update from 3 remaining nodes
2019-11-04 00:12:30.623398 I | op-osd: osd orchestration status for node khst260 is completed
2019-11-04 00:12:30.623433 I | op-osd: starting 0 osd daemons on node khst260
2019-11-04 00:12:36.243300 I | op-osd: osd orchestration status for node khst160 is completed
2019-11-04 00:12:36.243343 I | op-osd: starting 0 osd daemons on node khst160
2019-11-04 00:13:13.495415 I | op-osd: osd orchestration status for node khst350 is completed
2019-11-04 00:13:13.495451 I | op-osd: starting 0 osd daemons on node khst350
2019-11-04 00:13:13.504188 I | op-osd: 17/17 node(s) completed osd provisioning
2019-11-04 00:13:13.504247 I | op-osd: checking if any nodes were removed
2019-11-04 00:13:13.698897 I | op-osd: processing 0 removed nodes
2019-11-04 00:13:13.698923 I | op-osd: done processing removed nodes
2019-11-04 00:13:13.699095 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/840680568
2019-11-04 00:13:14.220604 I | op-osd: completed running osds in namespace rook-ceph
2019-11-04 00:13:14.220642 I | rbd-mirror: configure rbd-mirroring with 0 workers
2019-11-04 00:13:14.234404 I | rbd-mirror: no extra daemons to remove
2019-11-04 00:13:14.234428 I | op-cluster: Done creating rook instance in namespace rook-ceph
2019-11-04 00:13:14.234458 I | op-cluster: CephCluster rook-ceph status: Created. 
2019-11-04 00:13:14.257691 I | op-pool: start watching clusters in all namespaces
2019-11-04 00:13:14.257730 I | op-object: start watching object store resources in namespace rook-ceph
2019-11-04 00:13:14.257751 I | op-object: start watching object store user resources in namespace rook-ceph
2019-11-04 00:13:14.257767 I | op-bucket-prov: Ceph Bucket Provisioner launched
2019-11-04 00:13:14.260317 I | op-file: start watching filesystem resource in namespace rook-ceph
2019-11-04 00:13:14.260347 I | op-nfs: start watching ceph nfs resource in namespace rook-ceph
2019-11-04 00:13:14.260366 I | op-cluster: ceph status check interval is 60s
I1104 00:13:14.260482       8 manager.go:98] objectbucket.io/provisioner-manager "level"=0 "msg"="starting provisioner"  "name"="ceph.rook.io/bucket"
2019-11-04 00:13:14.263377 I | exec: Running command: ceph osd crush dump --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/411040874
2019-11-04 00:13:14.263916 I | op-object: creating object store rook-ceph-s3-1
2019-11-04 00:13:14.264155 I | exec: Running command: ceph osd crush dump --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/534446273
2019-11-04 00:13:14.265062 I | op-cluster: finalizer already set on cluster rook-ceph
2019-11-04 00:13:14.266755 I | op-object: creating user backups in namespace rook-ceph
2019-11-04 00:13:15.001397 I | exec: Running command: ceph osd crush dump --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/800318252
2019-11-04 00:13:15.030345 I | op-pool: creating pool rook-ceph-block-pool-1 in namespace rook-ceph
2019-11-04 00:13:15.030490 I | exec: Running command: ceph osd crush rule create-replicated rook-ceph-block-pool-1 default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/220950683
2019-11-04 00:13:15.601748 I | op-object: creating object store rook-ceph-s3-1 in namespace rook-ceph
2019-11-04 00:13:15.623999 I | op-object: Gateway service running at 10.100.38.38:80
2019-11-04 00:13:15.624146 I | exec: Running command: ceph osd pool get rook-ceph-s3-1.rgw.control all --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/274002238
2019-11-04 00:13:15.678852 I | exec: Running command: ceph osd pool create rook-ceph-block-pool-1 0 replicated rook-ceph-block-pool-1 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/011948933
2019-11-04 00:13:16.156449 I | exec: Error ENOENT: unrecognized pool 'rook-ceph-s3-1.rgw.control'
2019-11-04 00:13:16.156688 I | exec: Running command: ceph osd crush rule create-replicated rook-ceph-s3-1.rgw.control default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/987059488
2019-11-04 00:13:16.689017 I | exec: pool 'rook-ceph-block-pool-1' created
2019-11-04 00:13:16.689368 I | exec: Running command: ceph osd pool set rook-ceph-block-pool-1 size 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/512191999
2019-11-04 00:13:17.731206 I | exec: set pool 1 size to 3
2019-11-04 00:13:17.731539 I | exec: Running command: ceph osd pool application enable rook-ceph-block-pool-1 rbd --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/306350930
2019-11-04 00:13:17.749777 I | exec: Running command: ceph osd pool create rook-ceph-s3-1.rgw.control 0 replicated rook-ceph-s3-1.rgw.control --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/422044809
2019-11-04 00:13:18.778171 I | exec: enabled application 'rbd' on pool 'rook-ceph-block-pool-1'
2019-11-04 00:13:18.778348 I | cephclient: creating replicated pool rook-ceph-block-pool-1 succeeded, buf: 
2019-11-04 00:13:18.778364 I | op-pool: created pool rook-ceph-block-pool-1
2019-11-04 00:13:18.795706 I | exec: pool 'rook-ceph-s3-1.rgw.control' created
2019-11-04 00:13:18.795959 I | exec: Running command: ceph osd pool set rook-ceph-s3-1.rgw.control size 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/492835924
2019-11-04 00:13:19.840116 I | exec: set pool 2 size to 3
2019-11-04 00:13:19.840445 I | exec: Running command: ceph osd pool application enable rook-ceph-s3-1.rgw.control rook-ceph-rgw --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/645765539
2019-11-04 00:13:20.862306 I | exec: enabled application 'rook-ceph-rgw' on pool 'rook-ceph-s3-1.rgw.control'
2019-11-04 00:13:20.862475 I | cephclient: creating replicated pool rook-ceph-s3-1.rgw.control succeeded, buf: 
2019-11-04 00:13:20.862603 I | exec: Running command: ceph osd pool get rook-ceph-s3-1.rgw.meta all --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/231007398
2019-11-04 00:13:21.389325 I | exec: Error ENOENT: unrecognized pool 'rook-ceph-s3-1.rgw.meta'
2019-11-04 00:13:21.389579 I | exec: Running command: ceph osd crush rule create-replicated rook-ceph-s3-1.rgw.meta default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/439985101
2019-11-04 00:13:22.977635 I | exec: Running command: ceph osd pool create rook-ceph-s3-1.rgw.meta 0 replicated rook-ceph-s3-1.rgw.meta --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/649768136
2019-11-04 00:13:23.980477 I | exec: pool 'rook-ceph-s3-1.rgw.meta' created
2019-11-04 00:13:23.980689 I | exec: Running command: ceph osd pool set rook-ceph-s3-1.rgw.meta size 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/785468807
2019-11-04 00:13:25.052054 I | exec: set pool 3 size to 3
2019-11-04 00:13:25.052304 I | exec: Running command: ceph osd pool application enable rook-ceph-s3-1.rgw.meta rook-ceph-rgw --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/693835066
2019-11-04 00:13:26.063416 I | exec: enabled application 'rook-ceph-rgw' on pool 'rook-ceph-s3-1.rgw.meta'
2019-11-04 00:13:26.063606 I | cephclient: creating replicated pool rook-ceph-s3-1.rgw.meta succeeded, buf: 
2019-11-04 00:13:26.063780 I | exec: Running command: ceph osd pool get rook-ceph-s3-1.rgw.log all --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/214679889
2019-11-04 00:13:26.621440 I | exec: Error ENOENT: unrecognized pool 'rook-ceph-s3-1.rgw.log'
2019-11-04 00:13:26.621674 I | exec: Running command: ceph osd crush rule create-replicated rook-ceph-s3-1.rgw.log default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/820149884
2019-11-04 00:13:28.214937 I | exec: Running command: ceph osd pool create rook-ceph-s3-1.rgw.log 0 replicated rook-ceph-s3-1.rgw.log --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/723602347
2019-11-04 00:13:29.221338 I | exec: pool 'rook-ceph-s3-1.rgw.log' created
2019-11-04 00:13:29.221645 I | exec: Running command: ceph osd pool set rook-ceph-s3-1.rgw.log size 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/351857934
2019-11-04 00:13:29.266996 I | op-object: waiting for CephObjectStore rook-ceph-s3-1 to be created
2019-11-04 00:13:29.274529 I | op-object: waiting for CephObjectStore rook-ceph-s3-1 to be initialized
2019-11-04 00:13:29.294671 I | op-object: CephObjectStore rook-ceph-s3-1 has been created successfully
2019-11-04 00:13:29.294696 I | op-object: Creating user: backups
2019-11-04 00:13:29.294713 I | exec: Running command: radosgw-admin user create --uid backups --display-name Backups User --rgw-realm=rook-ceph-s3-1 --rgw-zonegroup=rook-ceph-s3-1 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
2019-11-04 00:13:30.273547 I | exec: set pool 4 size to 3
2019-11-04 00:13:30.273874 I | exec: Running command: ceph osd pool application enable rook-ceph-s3-1.rgw.log rook-ceph-rgw --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/716111125
2019-11-04 00:13:31.330343 I | exec: enabled application 'rook-ceph-rgw' on pool 'rook-ceph-s3-1.rgw.log'
2019-11-04 00:13:31.330502 I | cephclient: creating replicated pool rook-ceph-s3-1.rgw.log succeeded, buf: 
2019-11-04 00:13:31.330648 I | exec: Running command: ceph osd pool get rook-ceph-s3-1.rgw.buckets.index all --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/984619888
2019-11-04 00:13:31.868171 I | exec: Error ENOENT: unrecognized pool 'rook-ceph-s3-1.rgw.buckets.index'
2019-11-04 00:13:31.868411 I | exec: Running command: ceph osd crush rule create-replicated rook-ceph-s3-1.rgw.buckets.index default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/941563407
2019-11-04 00:13:33.371060 I | exec: Running command: ceph osd pool create rook-ceph-s3-1.rgw.buckets.index 0 replicated rook-ceph-s3-1.rgw.buckets.index --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/011328546
2019-11-04 00:13:34.398707 I | exec: pool 'rook-ceph-s3-1.rgw.buckets.index' created
2019-11-04 00:13:34.399016 I | exec: Running command: ceph osd pool set rook-ceph-s3-1.rgw.buckets.index size 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/964039961
2019-11-04 00:13:35.437165 I | exec: set pool 6 size to 3
2019-11-04 00:13:35.437503 I | exec: Running command: ceph osd pool application enable rook-ceph-s3-1.rgw.buckets.index rook-ceph-rgw --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/337259940
2019-11-04 00:13:36.443363 I | exec: enabled application 'rook-ceph-rgw' on pool 'rook-ceph-s3-1.rgw.buckets.index'
2019-11-04 00:13:36.443533 I | cephclient: creating replicated pool rook-ceph-s3-1.rgw.buckets.index succeeded, buf: 
2019-11-04 00:13:36.443683 I | exec: Running command: ceph osd pool get .rgw.root all --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/289142963
2019-11-04 00:13:36.992954 I | exec: Running command: ceph osd pool get rook-ceph-s3-1.rgw.buckets.data all --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/007204470
2019-11-04 00:13:37.513495 I | exec: Error ENOENT: unrecognized pool 'rook-ceph-s3-1.rgw.buckets.data'
2019-11-04 00:13:37.513746 I | exec: Running command: ceph osd crush rule create-replicated rook-ceph-s3-1.rgw.buckets.data default rack --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/485640541
2019-11-04 00:13:38.519657 I | exec: Running command: ceph osd pool create rook-ceph-s3-1.rgw.buckets.data 0 replicated rook-ceph-s3-1.rgw.buckets.data --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/580165400
2019-11-04 00:13:39.630801 I | exec: pool 'rook-ceph-s3-1.rgw.buckets.data' created
2019-11-04 00:13:39.631082 I | exec: Running command: ceph osd pool set rook-ceph-s3-1.rgw.buckets.data size 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/113204631
2019-11-04 00:13:40.646381 I | exec: set pool 7 size to 3
2019-11-04 00:13:40.646667 I | exec: Running command: ceph osd pool application enable rook-ceph-s3-1.rgw.buckets.data rook-ceph-rgw --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/147109386
2019-11-04 00:13:41.695727 I | exec: enabled application 'rook-ceph-rgw' on pool 'rook-ceph-s3-1.rgw.buckets.data'
2019-11-04 00:13:41.695906 I | cephclient: creating replicated pool rook-ceph-s3-1.rgw.buckets.data succeeded, buf: 
2019-11-04 00:13:41.695949 I | exec: Running command: radosgw-admin realm list --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
2019-11-04 00:13:59.811201 W | op-mon: mon b NOT found in ceph mon map, failover
2019-11-04 00:13:59.811232 I | op-mon: Failing over monitor b
2019-11-04 00:13:59.811275 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-e DaemonName:e PublicIP: Port:6789 DataPathMap:0xc0156c0210}
2019-11-04 00:13:59.821900 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-e-canary
2019-11-04 00:13:59.834723 I | op-mon: sched-mon: waiting for canary pod creation rook-ceph-mon-e-canary
W1104 00:16:33.048229       8 reflector.go:289] github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:173: watch of *v1.ConfigMap ended with: too old resource version: 677787218 (677805739)
W1104 00:22:53.117661       8 reflector.go:289] github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:173: watch of *v1.ConfigMap ended with: too old resource version: 677814264 (677827425)
2019-11-04 00:29:02.207014 I | op-mon: assignmon: cleaning up canary deployment rook-ceph-mon-e-canary and canary pvc 
2019-11-04 00:29:02.207063 I | op-k8sutil: removing deployment rook-ceph-mon-e-canary if it exists
2019-11-04 00:29:02.215960 I | op-k8sutil: Removed deployment rook-ceph-mon-e-canary
2019-11-04 00:29:02.219618 I | op-k8sutil: rook-ceph-mon-e-canary still found. waiting...
2019-11-04 00:29:04.223450 I | op-k8sutil: confirmed rook-ceph-mon-e-canary does not exist
2019-11-04 00:29:04.223491 E | op-mon: failed to failover mon b. failed to place new mon on a node. assignmon: error scheduling monitor: sched-mon: canary pod scheduling failed retries
2019-11-04 00:29:49.764010 W | op-mon: mon b NOT found in ceph mon map, failover
2019-11-04 00:29:49.764041 I | op-mon: Failing over monitor b
2019-11-04 00:29:49.764083 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-e DaemonName:e PublicIP: Port:6789 DataPathMap:0xc011705aa0}
2019-11-04 00:29:49.783158 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-e-canary
2019-11-04 00:29:49.793650 I | op-mon: sched-mon: waiting for canary pod creation rook-ceph-mon-e-canary
...

At this point we are in a disaster recovery situation:

https://github.com/rook/rook/blob/release-1.1/Documentation/disaster-recovery.md

Environment:

OS: Ubuntu Bionic 18.04.3 LTS
Kernel: 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: Bare metal
Rook version: v1.1.2
Storage backend version: v14.2.4-20190917
Kubernetes version: v1.15.2
Kubernetes cluster type: kubeadm
Storage backend status: HEALTH_OK

mkhpalm commented 4 years ago

Just realized this change in rook v1.1.6 would have possibly avoided the part where it clobbered our existing cluster. Assuming with that change it wouldn't have tried to orchestrate when it did or when rook operator restarted.

https://github.com/rook/rook/pull/4252

mkhpalm commented 4 years ago

In testing v1.1.6 it does not appear as #4252 would have avoided the operator from clobbering a healthy ceph cluster.

Rook always crashes on losing a leader election
Rook always orchestrates on startup regardless of whether there are any CR changes

travisn commented 4 years ago

It's expected that the operator will start a new orchestration when it restarts, and also in several other scenarios such as when the cluster CR is updated. This is to ensure the operator will maintain the desired state in the cluster.

The key lines in the log are here where we see the cluster get reset:

2019-11-04 00:03:14.024417 I | op-cluster: Detected ceph image version: 14.2.4 nautilus
2019-11-04 00:03:14.027408 E | cephconfig: clusterInfo: <nil>
2019-11-04 00:03:14.027439 I | op-cluster: CephCluster rook-ceph status: Creating. 
2019-11-04 00:03:14.053487 I | op-mon: start running mons
2019-11-04 00:03:14.059503 I | exec: Running command: ceph-authtool --create-keyring /var/lib/rook/rook-ceph/mon.keyring --gen-key -n mon. --cap mon 'allow *'
2019-11-04 00:03:14.117428 I | exec: Running command: ceph-authtool --create-keyring /var/lib/rook/rook-ceph/client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mgr 'allow *' --cap mds 'allow'
2019-11-04 00:03:14.187273 I | op-mon: creating mon secrets for a new cluster

Walking through this code path, the k8s API must have returned NotFound for the secret (see here) where the basic cluster info is stored. If the secret is not found, Rook will assume that it's a new cluster and create new creds here.

The killer here is if the k8s API is returning invalid responses, the operator will do the wrong thing. If K8s is returning invalid responses to the caller instead of failure codes, we really need a fix from K8s or else a way to know when the K8s API is unstable.

mkhpalm commented 4 years ago

Thanks, I was trying to figure out exactly where it got onto another path but was never fully sure based on my limited understanding.

Curious about two things:

Is operator supposed to restart when it loses leader elections?
Is there a reason why rook keeps doing work it has already successfully completed before? E.g. updating deployments for changes it already successfully did previously.

I ask because with those two behaviors combined put rook in orchestration mode during various periods of instability. (cp issues, operator's host issues, network partitions, etc) It seems to me if one or the other behaved differently then the risk of what happened to us drops substantially.

travisn commented 4 years ago

The leader election is an internal detail of the controllers in the operator. When they get reset, then they might cause the operator to think it's time to start a new orchestration as if it was just restarted. Are you actually seeing the operator pod restarted? If so I'd like to see the previous logs from when the operator shut down. Or if the operator is just kicking off an orchestration, then it's a normal event.
The way the operator manages state is by re-applying the state. These actions are idempotent and are intended to have no side effects if the state is already applied. But if there is an upgrade, there could be a change to a pod spec and the ceph daemon pods would be restarted. This is necessary for upgrades. This is how operators are designed... They ensure "desired" state is applied rather than performing imperative tasks once.

Kubernetes as a distributed application platform must be able to survive network partitions and other temporary or catastrophic events. It's based on etcd for its config store for this reason. If there is ever a loss of quorum, etcd will halt and the cluster should stop working. Similarly, Ceph is also designed to halt if the cluster is too unhealthy, rather than continuing and corrupting things.

Is there a possibility that your K8s automation is reseting K8s in some way that would be causing this? I haven't heard of others experiencing this issue. This corruption is completely unexpected from K8s. Otherwise, Rook or other applications can't rely on it as a distributed platform.

kzh commented 4 years ago

I'm having the same problem on my kubernetes cluster with rook-ceph v1.1.

csi-cephfsplugin-provisioner-75c965db4f-7g8sd   4/4     Running     15         99m
csi-cephfsplugin-provisioner-75c965db4f-k7tlb   4/4     Running     7          99m
csi-cephfsplugin-spj6n                          3/3     Running     0          99m
csi-rbdplugin-provisioner-56cbc4d585-jpt98      5/5     Running     10         99m
csi-rbdplugin-provisioner-56cbc4d585-rfndh      5/5     Running     21         99m
csi-rbdplugin-vr6hm                             3/3     Running     0          99m
rook-ceph-mgr-a-799cd96bdd-vc8dj                1/1     Running     9          98m
rook-ceph-mon-a-5fc6c66b55-b8g7v                1/1     Running     0          98m
rook-ceph-operator-d6d6b84dc-krt29              1/1     Running     11         103m
rook-ceph-osd-0-6496c4d87c-rhxpz                1/1     Running     0          97m
rook-ceph-osd-prepare-faust-fb4pc               0/1     Completed   0          4m27s
rook-discover-nhqm2                             1/1     Running     0          103m

These pods have been constantly crashing and restarting due to:

I1117 05:48:54.274816       6 leaderelection.go:263] failed to renew lease rook-ceph/rook.io-block: failed to tryAcquireOrRenew context deadline exceeded
F1117 05:48:54.274890       6 controller.go:847] leaderelection lost

I'm running on a single node cluster (k8s v1.16). The machine specs are:

Intel® 2xE5 - 2660
16C/32T - 2.2/3.0GHz
256GB DDR3 Ram

This seems related to #4158

Update: I completely reinstalled ubuntu on my dedicated machine (+ reinstalled everything) and the problem seems to be absent now. I suspect this problem might have been due to a hardware reboot leading to corruption somewhere?

Something else I noticed was that coredns and tiller were also crashing and restarting but to liveness probes failing randomly.

mkhpalm commented 4 years ago

Are you actually seeing the operator pod restarted?

Correct, it restarted with a new container hash. In the ticket above I gave the tail end of the crashed container logs and then the new container's logs after that. FWIW rook operator had been running without issue under that same hash for over a month since we last updated to 1.1.2.

This is what stood out in the old container's logs. F is for fatal?

F1104 00:00:13.648350       8 controller.go:847] leaderelection lost

Same behavior in the previous comment and this ticket.

Is there a possibility that your K8s automation is reseting K8s in some way that would be causing this?

There isn't any external automation that would be resetting stuff like rook. What affected us was a default fstrim timer that comes with the distro we hosted our etcd servers on. We know that increased the disk io and caused rook to originally lose its leader lease. Given the issue with the control plane it was expected that rook was unable to renew (write) its lease to the k8s API when it did. We can see why it would have had that issue in the apiserver and etcd logs.

The surprising part came after it was unable to renew that lease. The operator restarted and began its full orchestration cycle due to the same event that caused it to lose the leader election in the first place.

BlaineEXE commented 4 years ago

I'm having trouble repro-ing this reliably. fstrim isn't causing failures in my environment, and I've tried repeatedly deleting kubeadm's etcd-k8s-master-0, kube-apiserver-k8s-master-0, and kube-controller-manager-k8s-master-0 pods without luck.

erik-stephens commented 4 years ago

fstrim isn't the interesting bit here. It happened to be what caused the kubernetes api to become unavailable. The concern is that momentary loss of kubernetes api should not result in rook thinking it needs to create a new cluster.

BlaineEXE commented 4 years ago

fstrim isn't the interesting bit here. It happened to be what caused the kubernetes api to become unavailable. The concern is that momentary loss of kubernetes api should not result in rook thinking it needs to create a new cluster.

I am aware that it's not interesting, but I do need a way to reliable repro the issue, and one poster mentioned fstrim caused the problem in their environment. Any help in finding a reliable way to repro would be appreciated.

mkhpalm commented 4 years ago

In this case you just need to create enough disk io until etcd starts timing out and logging things like read-only range request. You can use anything to do that (fio, dd, hdparm, etc) to increase it on the etcd host. Rook will "wake up" by crashing when trying to update its leader lease.

BlaineEXE commented 4 years ago

@mkhpalm can you rerun and get a new log with the log level set to "DEBUG"?

BlaineEXE commented 4 years ago

I've been trying to test this with the latest master, and I can't repro this despite disk IO causing pretty regular leader election failures.

mkhpalm commented 4 years ago

I think there are 2 problems going on here from what we've seen:

Did you notice the container restarting when it lost leader election? When it does crash it comes back in a blank state and starts trying to do everything any time a control plane has an issue. I assume that behavior is undesirable right? Losing a leader election shouldn't result in a crash. (examples above)
Is maybe more of a safety precaution or hardening issue where if the point above didn't happen would be less of a concern. I think there is a linchpin @travisn pointed out here: https://github.com/rook/rook/issues/4274#issuecomment-553153087. Despite rook seeing the mons earlier on it got an empty response paths and overwrote an existing cluster due to a bad response from the k8s api.

I believe if either of those two things behaved differently, similar stories to ours wouldn't be popping up out there in the k8s world:

https://medium.com/flant-com/rook-cluster-recovery-580efcd275db

I didn't see any root cause in that blog post but I'm fairly certain their cluster disappeared due to a similar situation with his control plane. We've seen it happen twice now in 2 different clusters. The other cluster lost one of its etcd hosts and the same thing happened as this ticket. I probably also need to say the other cluster is intentionally chaotic for other purposes and for the one time it overwrote its ceph cluster it survived etcd hosts going down many times before that. The risk here being that leader election crashing operator causing it to run a complete operator loop any time there are control plane issues. If it didn't crash/restart it would be MUCH less likely to run into the situation where it gets a bad response looking up the secret to see if a cluster already exists.

yanchicago commented 4 years ago

With rook 1.1.7 release, I've seen rook-ceph-operator restarted as well due to "leaderelection lost". And it mostly correlates to k8s api server not responding.

In my case, the cluster is not disappearing. But some osds were restarted by the operator orchestrating. 2020-02-04 11:08:33.964808 I | op-osd: deployment for osd 1 already exists. updating if needed 2020-02-04 11:08:33.970133 I | op-mon: this is not an upgrade, not performing upgrade checks 2020-02-04 11:08:33.970158 I | op-k8sutil: updating deployment rook-ceph-osd-1 2020-02-04 11:08:40.499416 E | op-osd: failed to update osd deployment 1. failed to get deployment rook-ceph-osd-1. rpc error: code = Unavailable desc = etcdserver: leader changed 2020-02-04 11:08:40.499452 I | op-osd: started deployment for osd 1 (dir=false, type=bluestore) I0204 11:08:47.152367 6 leaderelection.go:263] failed to renew lease rook-ceph/rook.io-block: failed to tryAcquireOrRenew context deadline exceeded F0204 11:08:47.152412 6 controller.go:847] leaderelection lost

Is there a config parameter so that the operator can be more tolerating on the temporary k8s API failure?

BlaineEXE commented 4 years ago

We need a method of artificially creating (reproducing) the API server not responding scenario for developing and testing a solution to this. Does anyone have a way to reliably repro this that I can follow. I haven't had success with kubeadm clusters trying to stop the API server pods themselves.

aberfeldy commented 4 years ago

Since last week we're facing the same problem. After an OSD crashed the operator dies on the leader election and the pod is always restarted. logs: https://gist.github.com/aberfeldy/98c5a1c38eb3485eea1925597c6c4bd0

travisn commented 4 years ago

@aberfeldy Are you seeing the rook resources (ie. pods) being deleted? Or the operator just keeps crashing? This original issue is around the operator unexpectedly removing resources, which may be different from what you are seeing.

aberfeldy commented 4 years ago

I see some crashcollector pods being started on nodes where they shouldn't but no deletion so far. You're right, maybe I should move this to a separate issue.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

RussianNeuroMancer commented 3 years ago

@travisn @BlaineEXE

I wonder if this issue should stay open until some solution or workaround is found? Destroyed cluster is quite critical problem, in my opinion.

travisn commented 3 years ago

@travisn @BlaineEXE

I wonder if this issue should stay open until some solution or workaround is found? Destroyed cluster is quite critical problem, in my opinion.

Yes, of course it is critical. Have you seen this issue or know how to repro? If the cluster ever gets destroyed unintentionally we need to get to the bottom of it, but it's still not clear if that is actually happening. Other comments in this issue are around the operator simply restarting, which isn't destructive.

RussianNeuroMancer commented 3 years ago

@travisn

Have you seen this issue or know how to repro?

No, I come here after reading Flant article mentioned in this commit.

It's seems like as long as hardware can handle load you won't able to reproduce this. Maybe scheduling api server pod to resource constrained vm running on slow old host could help with reproducing this issue?

andrewsali commented 3 years ago

Unfortunately we also experienced this issue with an AKS cluster - after an AKS control plane outage Rook recreated the cluster with new secrets, etc.. and the OSD-s were no longer able to join the cluster.

Not sure what triggered it exactly, but the AKS control plane was unavailable for some time - we didn't have the AKS Uptime SLA feature turned on at the time / or we might have hit some API limits.

We also saw this 'leader election lost' message when the Rook cluster fell apart and started recreating a new one:

travisn commented 3 years ago

@andrewsali Did you by chance capture the operator logs that show the new secrets being created? With the operator restarting frequently I assume they're gone, but the lack of operator logs has made it difficult to track this down.

andrewsali commented 3 years ago

@travisn I have exported the logs collected from the rook-ceph namespace for that period +/ 20 minutes - I can send them over by email if that works for you (don't want to post all of it here).

Madhu-1 commented 3 years ago

@andrewsali you can attach the log file to this issue.

andrewsali commented 3 years ago

I have updated the log files, I had previously a wrong join - this one now only contains the operator logs and should contain all the operator logs from that period.

rook_crash_1 (2).zip

andrewsali commented 3 years ago

I have edited my previous post and uploaded the corrected log files (my initial export had the wrong output and didn't contain all log messages). If you need any other logs than the operator, please let me know.

travisn commented 3 years ago

@andrewsali It appears that you are hitting #5869. To confirm if you hit this issue, you would see this message in the operator log:

op-mon: removing an extra mon. currently 2 are in quorum and only 0 are desired

I don't see that message, so perhaps it was in a log before the portion that you shared. Do you have earlier logs to search? The cluster already seems in a bad state from the logs you shared.

One of the side effects of that issue is that the nodes become unassigned to the mons, such as this. Existing mons should be mapped to a valid node name, rather than "null".

op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.0.87.232:6789","10.0.231.111:6789"]}] data:a=10.0.87.232:6789,b=10.0.231.111:6789 mapping:{"node":{"a":null,"b":null}} maxMonId:1]

You also have only two mons so the cluster wasn't able to recover from this issue. If there had been three mons, at least two of them would have stayed in quorum. I've long thought we should prevent an even number of mons, but for now we just have a warning in the log. In the future, definitely configure your clusters with 3 mons.

mon count is even (given: 2), should be uneven, continuing

andrewsali commented 3 years ago

Thanks very much @travisn for taking a look. I searched earlier logs for 'op-mon', but only the following are available - I didn't see the one you mentioned unfortunately:

`

2020-08-10T03:29:18.935Z	d73878826f04c298b1e468b345c09e428351e6624f58bf102b4394be790b4856	2020-08-10 03:29:18.935166 I \
2020-08-10T03:29:19.52Z	d73878826f04c298b1e468b345c09e428351e6624f58bf102b4394be790b4856	2020-08-10 03:29:19.519948 I \| op-mon: Monitors in quorum: [a b]
2020-08-12T20:53:24.726Z	d73878826f04c298b1e468b345c09e428351e6624f58bf102b4394be790b4856	2020-08-12 20:53:24.726015 W \| op-mon: failed to get the list of monitor canary deployments. failed to list deployments with labelSelector app=rook-ceph-mon,mon_canary=true: etcdserver: request timed out
2020-08-13T17:15:04.239Z	9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6	2020-08-13 17:15:04.239423 I \| op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789
2020-08-13T17:15:07.84Z	9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6	2020-08-13 17:15:07.839596 I \| op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789
2020-08-13T17:15:27.494Z	9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6	2020-08-13 17:15:27.494218 I \| op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789
2020-08-13T17:15:28.282Z	9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6	2020-08-13 17:15:28.282746 I \| op-mon: start running mons
2020-08-13T17:15:28.299Z	9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6	2020-08-13 17:15:28.299843 I \| op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789
2020-08-13T17:15:28.33Z	9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6	2020-08-13 17:15:28.330430 I \| op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.0.212.95:6789","10.0.214.45:6789"]}] data:a=10.0.212.95:6789,b=10.0.214.45:6789 mapping:{"node":{"a":null,"b":null}} maxMonId:1]
2020-08-13T17:15:29.172Z	9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6	2020-08-13 17:15:29.171933 I \| op-mon: targeting the mon count 2

`

For sure, we are going to use 3 monitors going forward - will report here if we see any similar problems.

LittleFox94 commented 3 years ago

Hi,

seems I'm also affected by this - had disaster on wednesday, yesterday and just now. I'm running a (slightly changed: https://github.com/LittleFox94/rook/commit/b512059012b3f14522ea2fca6473781a68adfd86) 1.2.7 release, because I wasn't able to migrate my filestore OSDs yet.

Half an hour before it went down, mgr was still running but responded only with 503 to liveness check. In the logs were only the 503 to liveness check mixed with current cluster state (having correct storage, so OSDs and correct MONs still up). Sadly no logs from that. I tried restarting the mgr by deleting the pod, then this stuff here happened

Logs until I scaled down the operator to manually recovery my cluster: https://pastebin.com/YPNiXW4w, probably interesting part:

I0829 13:49:59.309447       6 controller.go:818] Started provisioner controller rook.io/block_rook-ceph-operator-ff9b5947-h4fzv_7d8e6eb3-e9fe-11ea-9b88-0242c0a81005!
2020-08-29 13:50:21.755623 E | CmdReporter: continuing after failing delete job rook-ceph-detect-version; user may need to delete it manually. failed to remove previous provisioning job for node rook-ceph-detect-version. jobs.batch "rook-ceph-detect-version" not found
2020-08-29 13:50:21.764960 I | op-cluster: Detected ceph image version: "14.2.11-0 nautilus"
2020-08-29 13:50:21.775765 I | op-mon: parsing mon endpoints: ah=192.168.129.186:6789,af=192.168.130.70:6789,ae=192.168.134.165:6789
2020-08-29 13:50:21.775990 I | op-mon: loaded: maxMonID=33, mons=map[ae:0xc000c96be0 af:0xc000c96ba0 ah:0xc000c96b60], mapping=&{Node:map[ae:0xc000f51ce0 af:0xc000f51d10 ah:0xc000f51d40]}
2020-08-29 13:50:21.776556 I | cephconfig: writing config file /var/lib/rook/kube-system/kube-system.config
2020-08-29 13:50:21.776977 I | cephconfig: generated admin config in /var/lib/rook/kube-system
2020-08-29 13:50:21.777334 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=kube-system --conf=/var/lib/rook/kube-system/kube-system.config --keyring=/var/lib/rook/kube-system/client.admin.keyring --format json --out-file /tmp/172316857
2020-08-29 13:50:22.293132 I | op-cluster: cluster "kube-system": version "14.2.11-0 nautilus" detected for image "ceph/ceph:v14.2.11"
2020-08-29 13:50:22.309413 I | op-cluster: CephCluster "kube-system" status: "Creating".
2020-08-29 13:50:22.336763 I | op-mon: start running mons
2020-08-29 13:50:22.356779 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"kube-system","monitors":[]}] data: mapping:{"node":{}} maxMonId:-1]

I can confirm the operator crashing when LeaderElection fails, the Pod restarted after that (and did 4 times while doing my disaster recovery today):

2020-08-29 14:50:00.052851 I | op-k8sutil: updating deployment rook-ceph-rgw-external-s3-b
I0829 14:50:12.611893       6 leaderelection.go:263] failed to renew lease rook-system/ceph.rook.io-block: failed to tryAcquireOrRenew context deadline exceeded
F0829 14:50:12.611947       6 controller.go:847] leaderelection lost
I0829 14:50:12.617823       6 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"rook-system", Name:"ceph.rook.io-block", UID:"b4d603f6-7376-11ea-b395-6ae4fb1356bb", APIVersion:"v1", ResourceVersion:"207481865", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' rook-ceph-operator-ff9b5947-7b6pm_63df7ed9-ea06-11ea-b024-0242c0a8100a stopped leading

My etcd cluster has some problems when I have high I/O, it's some VPS with hard disks and the same fs is used for system, docker, etcd and rook (with filestore OSDs). Not ideal, but best I currently have. I'm running 3 controller nodes with a fully replicated control plane (each having etcd, kube-apiserver, kube-scheduler and kube-controller-manager). My controller nodes are worker nodes, too (so kube-proxy and kubelet). Control plane custom deployed with ansible, running on debian 10. Currently k8s 1.19.0-rc.4, I think this problem started with upgrading to the rc release - but also have dualstack since then.

Please tell me what you may need. I will recover my cluster now and keep the operator scaled down for now, it makes more harm than good currently ^^'

Disaster recovery

To recover I have to do the following steps:

scale down operator

fix rook-ceph-mon-endpoints configmap

data:
csi-cluster-config-json: '[{"clusterID":"kube-system","monitors":["$monitor_ips:6789",...]}]'
data: $monitor_name=$monitor_ip:6789,...
mapping: '{"node":{"$monitor_name":{"Name":"$monitor_node","Hostname":"$monitor_node","Address":"$monitor_node_ip"},...}'
maxMonId: "33" (this is for "ah", use highest in use for you)

fix rook-ceph-osd-$node-config configmap

data:
osd-dirs: '{"/var/lib/rook":$osd_id_on_this_node}' # might by filestore specific

delete all jobs, deploys and stuff
- kubectl delete deploy -l app=rook-ceph-osd
- kubectl delete deploy -l app=rook-ceph-mon
- kubectl delete job -l app=rook-ceph-osd-prepare

create services for your mons with their OLD IPs

apiVersion: v1
kind: Service
metadata:
labels:
app: rook-ceph-mon
ceph_daemon_id: $mon_id
mon: $mon_id
mon_cluster: kube-system
rook_cluster: kube-system
name: rook-ceph-mon-$mon_id
namespace: kube-system
ownerReferences:
- apiVersion: ceph.rook.io/v1
blockOwnerDeletion: true
kind: CephCluster
name: rook
uid: $ceph_cluster_uuid # not sure how important that is
spec:
ipFamily: IPv4
clusterIP:  $mon_ip
ports:
- name: msgr1
port: 6789
protocol: TCP
targetPort: 6789
- name: msgr2
port: 3300
protocol: TCP
targetPort: 3300
selector:
app: rook-ceph-mon
ceph_daemon_id: $mon_id
mon: $mon_id
mon_cluster: kube-system
rook_cluster: kube-system
sessionAffinity: None
type: ClusterIP

scale up operator and watch logs closely, be ready to kill it again
pray, if this is your thing
everything running again? nice. now check for nasty wrong new in your ceph cluster OSDs and purge them (for filestore: also delete the new OSD directories)

Look at this closely, it was copy&pasted in a hurry and might have things in it specific to my environment. I just wanted to fix mine, maybe help someone and continue watch Return of the King with my GF ^^'

A day later

Happened again, all deployments (osd, mgr, mon, rgw) and services for monitors gone. Operator was not running, when starting it, it created a new cluster from scratch. Seems like the operator isn't removing the things. Had high load on my cluster due to Ceph rebalance (migrated my first OSD to bluestore), maybe etcd had a major hickup from that.

Even later

Operator was not running, deployments, services and stuff completely removed. Looked in the correct logs this time: garbage collector removed them. Now tracking why

LittleFox94 commented 3 years ago

My problem seems to be the k8s garbage collector, probably matching issue: https://github.com/kubernetes/kubernetes/issues/88097

I disabled the GC for now and report back if the problem comes again

d-luu commented 3 years ago

I ran into this issue a few weeks ago.

When the control plane becomes unresponsive for some time (can replicate via load testing with scheduleable masters...this was a test environment), Kubernetes will rebuild its internal object graph and objects like the Ceph mon, osd, mgr will get cleaned up by the GC due to the CephCluster uid (if I recall correctly) no longer matching what's given in the ownerRefs, which triggers the rebuild.

Since this environment was a bare metal platform with static addresses, I worked around it by removing the ownerRefs from the objects and setting the external toggle in CephCluster.

I read somewhere that the controller, in this case Rook, would have to adopt the orphaned objects (before GC); but this is a bit racy as the operator could come up after objects have been GCed.

LittleFox94 commented 3 years ago

In my case the UID still matched, but with apiserver and etcd being unstable, the GC didn't load the resources, but also didn't bail out on error, at least it looked like that. You sure the UID didn't match? @d-luu ?

travisn commented 3 years ago

@d-luu When the GC rebuilds its tree, the UUIDs will change every time and in that case the GC is expected to ignore the UUIDs that don't match. The GC will only work based on the other ownerReference attributes where the UUID changes, so there must be some other issue.

LittleFox94 commented 3 years ago

@d-luu When the GC rebuilds its tree, the UUIDs will change every time and in that case the GC is expected to ignore the UUIDs that don't match. The GC will only work based on the other ownerReference attributes where the UUID changes, so there must be some other issue.

aren't the UUIDs generated by the apiserver and GC only uses them for building it's internal tree?

alexcpn commented 3 years ago

I was testing in a cluster, which for sometime the Master node was in NotReady state - Rebooted to bring it back. Also restarted rook-operator after that. It is hanging on the last log

2020-12-10 07:29:55.476626 I | rookcmd: starting Rook v1.5.1 with arguments '/usr/local/bin/rook ceph operator'
2020-12-10 07:29:55.476934 I | rookcmd: flag values: --add_dir_header=false, --alsologtostderr=false, --csi-cephfs-plugin-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin.yaml, --csi-cephfs-provisioner-dep-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-dep.yaml, --csi-rbd-plugin-template-path=/etc/ceph-csi/rbd/csi-rbdplugin.yaml, --csi-rbd-provisioner-dep-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-dep.yaml, --enable-discovery-daemon=false, --enable-flex-driver=false, --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-flush-frequency=5s, --log-level=INFO, --log_backtrace_at=:0, --log_dir=, --log_file=, --log_file_max_size=1800, --logtostderr=true, --master=, --mon-healthcheck-interval=45s, --mon-out-timeout=10m0s, --operator-image=, --service-account=, --skip_headers=false, --skip_log_headers=false, --stderrthreshold=2, --v=0, --vmodule=
2020-12-10 07:29:55.476947 I | cephcmd: starting Rook-Ceph operator
2020-12-10 07:29:55.687410 I | cephcmd: base ceph version inside the rook operator image is "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)"
2020-12-10 07:29:55.693283 I | operator: looking for secret "rook-ceph-admission-controller"
2020-12-10 07:29:55.696305 I | operator: secret "rook-ceph-admission-controller" not found. proceeding without the admission controller
2020-12-10 07:29:55.697893 I | operator: watching all namespaces for ceph cluster CRs
2020-12-10 07:29:55.700078 I | operator: setting up the controller-runtime manager
2020-12-10 07:29:55.702261 I | ceph-cluster-controller: ConfigMap "rook-ceph-operator-config" changes detected. Updating configurations
2020-12-10 07:29:56.260615 I | ceph-cluster-controller: successfully started
2020-12-10 07:29:56.266295 I | ceph-cluster-controller: enabling hotplug orchestration
2020-12-10 07:29:56.266409 I | ceph-crashcollector-controller: successfully started
2020-12-10 07:29:56.267027 I | ceph-block-pool-controller: successfully started
2020-12-10 07:29:56.267164 I | ceph-object-store-user-controller: successfully started
2020-12-10 07:29:56.267276 I | ceph-object-realm-controller: successfully started
2020-12-10 07:29:56.267399 I | ceph-object-zonegroup-controller: successfully started
2020-12-10 07:29:56.267495 I | ceph-object-zone-controller: successfully started
2020-12-10 07:29:56.267853 I | ceph-object-controller: successfully started
2020-12-10 07:29:56.267986 I | ceph-file-controller: successfully started
2020-12-10 07:29:56.268117 I | ceph-nfs-controller: successfully started
2020-12-10 07:29:56.268396 I | operator: starting the controller-runtime manager
2020-12-10 07:29:56.572527 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2020-12-10 07:29:56.580249 I | op-k8sutil: ROOK_CSI_ENABLE_RBD="true" (configmap)
2020-12-10 07:29:56.589101 I | op-k8sutil: ROOK_CSI_ENABLE_CEPHFS="true" (configmap)
2020-12-10 07:29:56.594377 I | op-k8sutil: ROOK_CSI_ALLOW_UNSUPPORTED_VERSION="false" (configmap)
2020-12-10 07:29:56.596255 I | op-k8sutil: ROOK_CSI_ENABLE_GRPC_METRICS="true" (configmap)
2020-12-10 07:29:56.602286 I | op-k8sutil: ROOK_CSI_CEPH_IMAGE="quay.io/cephcsi/cephcsi:v3.1.2" (default)
2020-12-10 07:29:56.608561 I | op-k8sutil: ROOK_CSI_REGISTRAR_IMAGE="k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.0.1" (default)
2020-12-10 07:29:56.613899 I | op-k8sutil: ROOK_CSI_PROVISIONER_IMAGE="k8s.gcr.io/sig-storage/csi-provisioner:v2.0.0" (default)
2020-12-10 07:29:56.617426 I | op-k8sutil: ROOK_CSI_ATTACHER_IMAGE="k8s.gcr.io/sig-storage/csi-attacher:v3.0.0" (default)
2020-12-10 07:29:56.620742 I | op-k8sutil: ROOK_CSI_SNAPSHOTTER_IMAGE="k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.0" (default)
2020-12-10 07:29:56.623337 I | op-k8sutil: ROOK_CSI_KUBELET_DIR_PATH="/var/lib/kubelet" (default)
2020-12-10 07:29:56.780412 I | op-k8sutil: ROOK_CSI_CEPHFS_POD_LABELS="" (default)
2020-12-10 07:29:56.979781 I | op-k8sutil: ROOK_CSI_RBD_POD_LABELS="" (default)
2020-12-10 07:29:57.386031 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config"
2020-12-10 07:29:57.386324 I | ceph-csi: detecting the ceph csi image version for image "quay.io/cephcsi/cephcsi:v3.1.2"
2020-12-10 07:29:57.779476 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="" (default)
2020-12-10 07:29:57.980047 I | op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789
2020-12-10 07:29:58.383299 I | ceph-cluster-controller: detecting the ceph image version for image ceph/ceph:v15.2.5...
2020-12-10 07:30:02.140520 I | ceph-csi: Detected ceph CSI image version: "v3.1.2"
2020-12-10 07:30:02.148123 I | op-k8sutil: CSI_FORCE_CEPHFS_KERNEL_CLIENT="true" (configmap)
2020-12-10 07:30:02.149853 I | op-k8sutil: CSI_CEPHFS_GRPC_METRICS_PORT="9091" (default)
2020-12-10 07:30:02.151772 I | op-k8sutil: CSI_CEPHFS_LIVENESS_METRICS_PORT="9081" (default)
2020-12-10 07:30:02.158089 I | op-k8sutil: CSI_RBD_GRPC_METRICS_PORT="9090" (default)
2020-12-10 07:30:02.162112 I | op-k8sutil: CSI_RBD_LIVENESS_METRICS_PORT="9080" (default)
2020-12-10 07:30:02.163769 I | op-k8sutil: CSI_PLUGIN_PRIORITY_CLASSNAME="" (default)
2020-12-10 07:30:02.166017 I | op-k8sutil: CSI_PROVISIONER_PRIORITY_CLASSNAME="" (default)
2020-12-10 07:30:02.167769 I | op-k8sutil: CSI_CEPHFS_PLUGIN_UPDATE_STRATEGY="RollingUpdate" (default)
2020-12-10 07:30:02.314579 I | op-k8sutil: CSI_RBD_PLUGIN_UPDATE_STRATEGY="RollingUpdate" (default)
2020-12-10 07:30:02.314617 I | ceph-csi: Kubernetes version is 1.19
2020-12-10 07:30:02.513633 I | op-k8sutil: ROOK_CSI_RESIZER_IMAGE="k8s.gcr.io/sig-storage/csi-resizer:v1.0.0" (default)
2020-12-10 07:30:02.727730 I | op-k8sutil: CSI_LOG_LEVEL="" (default)
2020-12-10 07:30:03.133505 I | ceph-csi: successfully started CSI Ceph RBD
2020-12-10 07:30:03.140987 I | ceph-csi: successfully started CSI CephFS driver
2020-12-10 07:30:03.315440 I | ceph-cluster-controller: detected ceph image version: "15.2.5-0 octopus"
2020-12-10 07:30:03.315481 I | ceph-cluster-controller: validating ceph version from provided image
2020-12-10 07:30:03.514113 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="" (default)
2020-12-10 07:30:03.912858 I | op-k8sutil: CSI_PROVISIONER_NODE_AFFINITY="" (default)
2020-12-10 07:30:04.116994 I | op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789
2020-12-10 07:30:04.333367 I | op-k8sutil: CSI_PLUGIN_TOLERATIONS="" (default)
2020-12-10 07:30:04.515300 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2020-12-10 07:30:04.515774 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2020-12-10 07:30:04.713735 I | op-k8sutil: CSI_PLUGIN_NODE_AFFINITY="" (default)
2020-12-10 07:30:04.887648 I | ceph-cluster-controller: cluster "rook-ceph": version "15.2.5-0 octopus" detected for image "ceph/ceph:v15.2.5"
2020-12-10 07:30:04.917199 I | op-k8sutil: CSI_RBD_PLUGIN_RESOURCE="" (default)
2020-12-10 07:30:05.120212 I | op-mon: start running mons
2020-12-10 07:30:05.313566 I | op-k8sutil: CSI_RBD_PROVISIONER_RESOURCE="" (default)
2020-12-10 07:30:05.913455 I | op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789
2020-12-10 07:30:06.722179 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.107.237.178:6789","10.105.199.73:6789","10.109.127.93:6789"]}] data:a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789 mapping:{"node":{"a":{"Name":"k8s-cluster-2.novalocal","Hostname":"k8s-cluster-2.novalocal","Address":"172.16.0.4"},"b":{"Name":"k8s-cluster-4.novalocal","Hostname":"k8s-cluster-4.novalocal","Address":"172.16.0.6"},"c":{"Name":"k8s-cluster-1.novalocal","Hostname":"k8s-cluster-1.novalocal","Address":"172.16.0.3"},"d":{"Name":"k8s-cluster-4.novalocal","Hostname":"k8s-cluster-4.novalocal","Address":"172.16.0.6"},"e":{"Name":"k8s-cluster-1.novalocal","Hostname":"k8s-cluster-1.novalocal","Address":"172.16.0.3"},"f":{"Name":"k8s-cluster-1.novalocal","Hostname":"k8s-cluster-1.novalocal","Address":"172.16.0.3"}}} maxMonId:5]
2020-12-10 07:30:06.914123 I | op-k8sutil: CSI_CEPHFS_PLUGIN_RESOURCE="" (default)
2020-12-10 07:30:07.313394 I | op-k8sutil: CSI_CEPHFS_PROVISIONER_RESOURCE="" (default)
2020-12-10 07:30:07.913466 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2020-12-10 07:30:07.913860 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2020-12-10 07:30:08.528228 I | ceph-csi: CSIDriver object updated for driver "rook-ceph.rbd.csi.ceph.com"
2020-12-10 07:30:08.536394 I | ceph-csi: CSIDriver object updated for driver "rook-ceph.cephfs.csi.ceph.com"
2020-12-10 07:30:09.515474 I | op-mon: targeting the mon count 3

though it is finding the three mons

alex@N-20HEPF0ZU9PR:/mnt/c/Users/acp/Documents/Coding/daas_project/dc-deployments/argocd/apps/vm-1$ kubectl -n rook-ceph get pods
NAME                                                              READY   STATUS     RESTARTS   AGE
csi-cephfsplugin-j2fq5                                            3/3     Running    0          168m
csi-cephfsplugin-j95sv                                            3/3     Running    0          151m
csi-cephfsplugin-provisioner-7dc78747bf-4q48l                     6/6     Running    12         3h26m
csi-cephfsplugin-provisioner-7dc78747bf-hznsj                     6/6     Running    0          44m
csi-cephfsplugin-v5grb                                            3/3     Running    0          3h26m
csi-rbdplugin-dj9gb                                               3/3     Running    0          168m
csi-rbdplugin-ffbmf                                               3/3     Running    0          151m
csi-rbdplugin-pkjww                                               3/3     Running    0          3h26m
csi-rbdplugin-provisioner-54d48757b4-s5tfm                        6/6     Running    13         3h26m
csi-rbdplugin-provisioner-54d48757b4-wh8ww                        6/6     Running    0          44m
rook-ceph-crashcollector-k8s-cluster-1.novalocal-6d65fc6d4clsmn   0/1     Init:0/2   0          44m
rook-ceph-crashcollector-k8s-cluster-2.novalocal-58cbd7d5d6c8mf   0/1     Init:0/2   0          95m
rook-ceph-crashcollector-k8s-cluster-4.novalocal-749d9cfb68mrsm   0/1     Init:0/2   0          84m
rook-ceph-mon-a-99b8744cf-wglqj                                   1/1     Running    0          76s
rook-ceph-mon-d-85d778df5f-9xxhs                                  1/1     Running    0          76s
rook-ceph-mon-f-764df56958-2dn48                                  1/1     Running    0          75s
rook-ceph-operator-78554769bd-dgwng                               1/1     Running    0          50s
rook-ceph-tools-6f44db7c58-xgpt9                                  1/1     Running    0          3h32m

travisn commented 3 years ago

I was testing in a cluster, which for sometime the Master node was in NotReady state - Rebooted to bring it back. Also restarted rook-operator after that. It is hanging on the last log

@alexcpn If you only have three mons named a,d,f it doesn't sound like your cluster is healthy. Was the cluster in a healthy state before the master had the issue and you restarted?

From this line, it looks like the mons were already named a,d,f so at least it didn't destroy the cluster metadata as suggested could happen with this original issue. However, mons a,d,f likely means things were not previously healthy anyway.

op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789

zerkms commented 3 years ago

@travisn why skipped letter means "cluster is not healthy"? I run a cluster with monitors named i, l, n because I have replaced nodes multiple times. Or is it particularly to the case when no osds are running as well?

travisn commented 3 years ago

@travisn why skipped letter means "cluster is not healthy"? I run a cluster with monitors named i, l, n because I have replaced nodes multiple times. Or is it particularly to the case when no osds are running as well?

@zerkms The mons can be other letters like your example, that can be expected. Specifically, if the mons are named a,d,f this case specifically is probably an error 99% of the time. This is basically an operator bug that tries to create all three mons instead of stopping when the first one doesn't form quorum.

andrewm-aero commented 3 years ago

I'd like to contribute my two cents. We were hit by this a month ago, and unfortunately, did not know about this issue at that time, and it was more important for the cluster in question to resume service than to recover the data, so we started from scratch and have no logs to contribute. What we can contribute, however, is that we are almost 100% certain this was triggered by an upgrade to the cluster through "rke up" (Rancher). We were able to upgrade from k8s 1.16 to 1.17 to 1.18 without issue, however, the issue manifested very shortly after the upgrade from 1.18 to 1.19 . At the time, we only had 1 api server, 1 etcd, and 1 mon (bad idea, I know, but it was a development cluster), but 3 osds. After rebuilding the ceph cluster, we were able to do an upgrade from k8s 1.19 to 1.20 without issue (with 3 api servers, etcd, and mons this time). We don't have the hardware available to try a do a full repro, but perhaps someone else can take this information and run with it.

travisn commented 2 months ago

While we (the maintainers) are concerned about this issue, closing due to the age of the issue and lack of repro. If more details are found for this issue, we will certainly seek to address it.

rook / rook

Rook becomes destructive during k8s control plane issues #4274

A day later

Even later