Closed mkhpalm closed 2 months ago
Just realized this change in rook v1.1.6 would have possibly avoided the part where it clobbered our existing cluster. Assuming with that change it wouldn't have tried to orchestrate when it did or when rook operator restarted.
In testing v1.1.6 it does not appear as #4252 would have avoided the operator from clobbering a healthy ceph cluster.
It's expected that the operator will start a new orchestration when it restarts, and also in several other scenarios such as when the cluster CR is updated. This is to ensure the operator will maintain the desired state in the cluster.
The key lines in the log are here where we see the cluster get reset:
2019-11-04 00:03:14.024417 I | op-cluster: Detected ceph image version: 14.2.4 nautilus
2019-11-04 00:03:14.027408 E | cephconfig: clusterInfo: <nil>
2019-11-04 00:03:14.027439 I | op-cluster: CephCluster rook-ceph status: Creating.
2019-11-04 00:03:14.053487 I | op-mon: start running mons
2019-11-04 00:03:14.059503 I | exec: Running command: ceph-authtool --create-keyring /var/lib/rook/rook-ceph/mon.keyring --gen-key -n mon. --cap mon 'allow *'
2019-11-04 00:03:14.117428 I | exec: Running command: ceph-authtool --create-keyring /var/lib/rook/rook-ceph/client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mgr 'allow *' --cap mds 'allow'
2019-11-04 00:03:14.187273 I | op-mon: creating mon secrets for a new cluster
Walking through this code path, the k8s API must have returned NotFound
for the secret (see here) where the basic cluster info is stored. If the secret is not found, Rook will assume that it's a new cluster and create new creds here.
The killer here is if the k8s API is returning invalid responses, the operator will do the wrong thing. If K8s is returning invalid responses to the caller instead of failure codes, we really need a fix from K8s or else a way to know when the K8s API is unstable.
Thanks, I was trying to figure out exactly where it got onto another path but was never fully sure based on my limited understanding.
Curious about two things:
I ask because with those two behaviors combined put rook in orchestration mode during various periods of instability. (cp issues, operator's host issues, network partitions, etc) It seems to me if one or the other behaved differently then the risk of what happened to us drops substantially.
The leader election is an internal detail of the controllers in the operator. When they get reset, then they might cause the operator to think it's time to start a new orchestration as if it was just restarted. Are you actually seeing the operator pod restarted? If so I'd like to see the previous logs from when the operator shut down. Or if the operator is just kicking off an orchestration, then it's a normal event.
The way the operator manages state is by re-applying the state. These actions are idempotent and are intended to have no side effects if the state is already applied. But if there is an upgrade, there could be a change to a pod spec and the ceph daemon pods would be restarted. This is necessary for upgrades. This is how operators are designed... They ensure "desired" state is applied rather than performing imperative tasks once.
Kubernetes as a distributed application platform must be able to survive network partitions and other temporary or catastrophic events. It's based on etcd for its config store for this reason. If there is ever a loss of quorum, etcd will halt and the cluster should stop working. Similarly, Ceph is also designed to halt if the cluster is too unhealthy, rather than continuing and corrupting things.
Is there a possibility that your K8s automation is reseting K8s in some way that would be causing this? I haven't heard of others experiencing this issue. This corruption is completely unexpected from K8s. Otherwise, Rook or other applications can't rely on it as a distributed platform.
I'm having the same problem on my kubernetes cluster with rook-ceph v1.1.
csi-cephfsplugin-provisioner-75c965db4f-7g8sd 4/4 Running 15 99m
csi-cephfsplugin-provisioner-75c965db4f-k7tlb 4/4 Running 7 99m
csi-cephfsplugin-spj6n 3/3 Running 0 99m
csi-rbdplugin-provisioner-56cbc4d585-jpt98 5/5 Running 10 99m
csi-rbdplugin-provisioner-56cbc4d585-rfndh 5/5 Running 21 99m
csi-rbdplugin-vr6hm 3/3 Running 0 99m
rook-ceph-mgr-a-799cd96bdd-vc8dj 1/1 Running 9 98m
rook-ceph-mon-a-5fc6c66b55-b8g7v 1/1 Running 0 98m
rook-ceph-operator-d6d6b84dc-krt29 1/1 Running 11 103m
rook-ceph-osd-0-6496c4d87c-rhxpz 1/1 Running 0 97m
rook-ceph-osd-prepare-faust-fb4pc 0/1 Completed 0 4m27s
rook-discover-nhqm2 1/1 Running 0 103m
These pods have been constantly crashing and restarting due to:
I1117 05:48:54.274816 6 leaderelection.go:263] failed to renew lease rook-ceph/rook.io-block: failed to tryAcquireOrRenew context deadline exceeded
F1117 05:48:54.274890 6 controller.go:847] leaderelection lost
I'm running on a single node cluster (k8s v1.16). The machine specs are:
Intel® 2xE5 - 2660
16C/32T - 2.2/3.0GHz
256GB DDR3 Ram
This seems related to #4158
Update: I completely reinstalled ubuntu on my dedicated machine (+ reinstalled everything) and the problem seems to be absent now. I suspect this problem might have been due to a hardware reboot leading to corruption somewhere?
Something else I noticed was that coredns
and tiller
were also crashing and restarting but to liveness probes failing randomly.
Are you actually seeing the operator pod restarted?
Correct, it restarted with a new container hash. In the ticket above I gave the tail end of the crashed container logs and then the new container's logs after that. FWIW rook operator had been running without issue under that same hash for over a month since we last updated to 1.1.2.
This is what stood out in the old container's logs. F
is for fatal?
F1104 00:00:13.648350 8 controller.go:847] leaderelection lost
Same behavior in the previous comment and this ticket.
Is there a possibility that your K8s automation is reseting K8s in some way that would be causing this?
There isn't any external automation that would be resetting stuff like rook. What affected us was a default fstrim timer that comes with the distro we hosted our etcd servers on. We know that increased the disk io and caused rook to originally lose its leader lease. Given the issue with the control plane it was expected that rook was unable to renew (write) its lease to the k8s API when it did. We can see why it would have had that issue in the apiserver and etcd logs.
The surprising part came after it was unable to renew that lease. The operator restarted and began its full orchestration cycle due to the same event that caused it to lose the leader election in the first place.
I'm having trouble repro-ing this reliably. fstrim
isn't causing failures in my environment, and I've tried repeatedly deleting kubeadm
's etcd-k8s-master-0
, kube-apiserver-k8s-master-0
, and kube-controller-manager-k8s-master-0
pods without luck.
fstrim isn't the interesting bit here. It happened to be what caused the kubernetes api to become unavailable. The concern is that momentary loss of kubernetes api should not result in rook thinking it needs to create a new cluster.
fstrim isn't the interesting bit here. It happened to be what caused the kubernetes api to become unavailable. The concern is that momentary loss of kubernetes api should not result in rook thinking it needs to create a new cluster.
I am aware that it's not interesting, but I do need a way to reliable repro the issue, and one poster mentioned fstrim caused the problem in their environment. Any help in finding a reliable way to repro would be appreciated.
In this case you just need to create enough disk io until etcd starts timing out and logging things like read-only range request
. You can use anything to do that (fio, dd, hdparm, etc) to increase it on the etcd host. Rook will "wake up" by crashing when trying to update its leader lease.
@mkhpalm can you rerun and get a new log with the log level set to "DEBUG"
?
I've been trying to test this with the latest master, and I can't repro this despite disk IO causing pretty regular leader election failures.
I think there are 2 problems going on here from what we've seen:
Did you notice the container restarting when it lost leader election? When it does crash it comes back in a blank state and starts trying to do everything any time a control plane has an issue. I assume that behavior is undesirable right? Losing a leader election shouldn't result in a crash. (examples above)
Is maybe more of a safety precaution or hardening issue where if the point above didn't happen would be less of a concern. I think there is a linchpin @travisn pointed out here: https://github.com/rook/rook/issues/4274#issuecomment-553153087. Despite rook seeing the mons earlier on it got an empty response paths and overwrote an existing cluster due to a bad response from the k8s api.
I believe if either of those two things behaved differently, similar stories to ours wouldn't be popping up out there in the k8s world:
https://medium.com/flant-com/rook-cluster-recovery-580efcd275db
I didn't see any root cause in that blog post but I'm fairly certain their cluster disappeared due to a similar situation with his control plane. We've seen it happen twice now in 2 different clusters. The other cluster lost one of its etcd hosts and the same thing happened as this ticket. I probably also need to say the other cluster is intentionally chaotic for other purposes and for the one time it overwrote its ceph cluster it survived etcd hosts going down many times before that. The risk here being that leader election crashing operator causing it to run a complete operator loop any time there are control plane issues. If it didn't crash/restart it would be MUCH less likely to run into the situation where it gets a bad response looking up the secret to see if a cluster already exists.
With rook 1.1.7 release, I've seen rook-ceph-operator restarted as well due to "leaderelection lost". And it mostly correlates to k8s api server not responding.
In my case, the cluster is not disappearing. But some osds were restarted by the operator orchestrating.
2020-02-04 11:08:33.964808 I | op-osd: deployment for osd 1 already exists. updating if needed 2020-02-04 11:08:33.970133 I | op-mon: this is not an upgrade, not performing upgrade checks 2020-02-04 11:08:33.970158 I | op-k8sutil: updating deployment rook-ceph-osd-1 2020-02-04 11:08:40.499416 E | op-osd: failed to update osd deployment 1. failed to get deployment rook-ceph-osd-1. rpc error: code = Unavailable desc = etcdserver: leader changed 2020-02-04 11:08:40.499452 I | op-osd: started deployment for osd 1 (dir=false, type=bluestore) I0204 11:08:47.152367 6 leaderelection.go:263] failed to renew lease rook-ceph/rook.io-block: failed to tryAcquireOrRenew context deadline exceeded F0204 11:08:47.152412 6 controller.go:847] leaderelection lost
Is there a config parameter so that the operator can be more tolerating on the temporary k8s API failure?
We need a method of artificially creating (reproducing) the API server not responding scenario for developing and testing a solution to this. Does anyone have a way to reliably repro this that I can follow. I haven't had success with kubeadm
clusters trying to stop the API server pods themselves.
Since last week we're facing the same problem. After an OSD crashed the operator dies on the leader election and the pod is always restarted. logs: https://gist.github.com/aberfeldy/98c5a1c38eb3485eea1925597c6c4bd0
@aberfeldy Are you seeing the rook resources (ie. pods) being deleted? Or the operator just keeps crashing? This original issue is around the operator unexpectedly removing resources, which may be different from what you are seeing.
I see some crashcollector pods being started on nodes where they shouldn't but no deletion so far. You're right, maybe I should move this to a separate issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
@travisn @BlaineEXE
I wonder if this issue should stay open until some solution or workaround is found? Destroyed cluster is quite critical problem, in my opinion.
@travisn @BlaineEXE
I wonder if this issue should stay open until some solution or workaround is found? Destroyed cluster is quite critical problem, in my opinion.
Yes, of course it is critical. Have you seen this issue or know how to repro? If the cluster ever gets destroyed unintentionally we need to get to the bottom of it, but it's still not clear if that is actually happening. Other comments in this issue are around the operator simply restarting, which isn't destructive.
@travisn
Have you seen this issue or know how to repro?
No, I come here after reading Flant article mentioned in this commit.
It's seems like as long as hardware can handle load you won't able to reproduce this. Maybe scheduling api server pod to resource constrained vm running on slow old host could help with reproducing this issue?
Unfortunately we also experienced this issue with an AKS cluster - after an AKS control plane outage Rook recreated the cluster with new secrets, etc.. and the OSD-s were no longer able to join the cluster.
Not sure what triggered it exactly, but the AKS control plane was unavailable for some time - we didn't have the AKS Uptime SLA feature turned on at the time / or we might have hit some API limits.
We also saw this 'leader election lost' message when the Rook cluster fell apart and started recreating a new one:
@andrewsali Did you by chance capture the operator logs that show the new secrets being created? With the operator restarting frequently I assume they're gone, but the lack of operator logs has made it difficult to track this down.
@travisn I have exported the logs collected from the rook-ceph namespace for that period +/ 20 minutes - I can send them over by email if that works for you (don't want to post all of it here).
@andrewsali you can attach the log file to this issue.
I have updated the log files, I had previously a wrong join - this one now only contains the operator logs and should contain all the operator logs from that period.
I have edited my previous post and uploaded the corrected log files (my initial export had the wrong output and didn't contain all log messages). If you need any other logs than the operator, please let me know.
@andrewsali It appears that you are hitting #5869. To confirm if you hit this issue, you would see this message in the operator log:
op-mon: removing an extra mon. currently 2 are in quorum and only 0 are desired
I don't see that message, so perhaps it was in a log before the portion that you shared. Do you have earlier logs to search? The cluster already seems in a bad state from the logs you shared.
One of the side effects of that issue is that the nodes become unassigned to the mons, such as this. Existing mons should be mapped to a valid node name, rather than "null".
op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.0.87.232:6789","10.0.231.111:6789"]}] data:a=10.0.87.232:6789,b=10.0.231.111:6789 mapping:{"node":{"a":null,"b":null}} maxMonId:1]
You also have only two mons so the cluster wasn't able to recover from this issue. If there had been three mons, at least two of them would have stayed in quorum. I've long thought we should prevent an even number of mons, but for now we just have a warning in the log. In the future, definitely configure your clusters with 3 mons.
mon count is even (given: 2), should be uneven, continuing
Thanks very much @travisn for taking a look. I searched earlier logs for 'op-mon', but only the following are available - I didn't see the one you mentioned unfortunately:
`
2020-08-10T03:29:18.935Z | d73878826f04c298b1e468b345c09e428351e6624f58bf102b4394be790b4856 | 2020-08-10 03:29:18.935166 I \ | op-mon: mons running: [a b] |
---|---|---|---|
2020-08-10T03:29:19.52Z | d73878826f04c298b1e468b345c09e428351e6624f58bf102b4394be790b4856 | 2020-08-10 03:29:19.519948 I | op-mon: Monitors in quorum: [a b] | |
2020-08-12T20:53:24.726Z | d73878826f04c298b1e468b345c09e428351e6624f58bf102b4394be790b4856 | 2020-08-12 20:53:24.726015 W | op-mon: failed to get the list of monitor canary deployments. failed to list deployments with labelSelector app=rook-ceph-mon,mon_canary=true: etcdserver: request timed out | |
2020-08-13T17:15:04.239Z | 9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6 | 2020-08-13 17:15:04.239423 I | op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789 | |
2020-08-13T17:15:07.84Z | 9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6 | 2020-08-13 17:15:07.839596 I | op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789 | |
2020-08-13T17:15:27.494Z | 9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6 | 2020-08-13 17:15:27.494218 I | op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789 | |
2020-08-13T17:15:28.282Z | 9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6 | 2020-08-13 17:15:28.282746 I | op-mon: start running mons | |
2020-08-13T17:15:28.299Z | 9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6 | 2020-08-13 17:15:28.299843 I | op-mon: parsing mon endpoints: a=10.0.212.95:6789,b=10.0.214.45:6789 | |
2020-08-13T17:15:28.33Z | 9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6 | 2020-08-13 17:15:28.330430 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.0.212.95:6789","10.0.214.45:6789"]}] data:a=10.0.212.95:6789,b=10.0.214.45:6789 mapping:{"node":{"a":null,"b":null}} maxMonId:1] | |
2020-08-13T17:15:29.172Z | 9ac1d6db735dcfef339edf4634f9b84ee15d444ce0dcb468323da5766a1448a6 | 2020-08-13 17:15:29.171933 I | op-mon: targeting the mon count 2 |
`
For sure, we are going to use 3 monitors going forward - will report here if we see any similar problems.
Hi,
seems I'm also affected by this - had disaster on wednesday, yesterday and just now. I'm running a (slightly changed: https://github.com/LittleFox94/rook/commit/b512059012b3f14522ea2fca6473781a68adfd86) 1.2.7 release, because I wasn't able to migrate my filestore OSDs yet.
Half an hour before it went down, mgr was still running but responded only with 503 to liveness check. In the logs were only the 503 to liveness check mixed with current cluster state (having correct storage, so OSDs and correct MONs still up). Sadly no logs from that. I tried restarting the mgr by deleting the pod, then this stuff here happened
Logs until I scaled down the operator to manually recovery my cluster: https://pastebin.com/YPNiXW4w, probably interesting part:
I0829 13:49:59.309447 6 controller.go:818] Started provisioner controller rook.io/block_rook-ceph-operator-ff9b5947-h4fzv_7d8e6eb3-e9fe-11ea-9b88-0242c0a81005!
2020-08-29 13:50:21.755623 E | CmdReporter: continuing after failing delete job rook-ceph-detect-version; user may need to delete it manually. failed to remove previous provisioning job for node rook-ceph-detect-version. jobs.batch "rook-ceph-detect-version" not found
2020-08-29 13:50:21.764960 I | op-cluster: Detected ceph image version: "14.2.11-0 nautilus"
2020-08-29 13:50:21.775765 I | op-mon: parsing mon endpoints: ah=192.168.129.186:6789,af=192.168.130.70:6789,ae=192.168.134.165:6789
2020-08-29 13:50:21.775990 I | op-mon: loaded: maxMonID=33, mons=map[ae:0xc000c96be0 af:0xc000c96ba0 ah:0xc000c96b60], mapping=&{Node:map[ae:0xc000f51ce0 af:0xc000f51d10 ah:0xc000f51d40]}
2020-08-29 13:50:21.776556 I | cephconfig: writing config file /var/lib/rook/kube-system/kube-system.config
2020-08-29 13:50:21.776977 I | cephconfig: generated admin config in /var/lib/rook/kube-system
2020-08-29 13:50:21.777334 I | exec: Running command: ceph versions --connect-timeout=15 --cluster=kube-system --conf=/var/lib/rook/kube-system/kube-system.config --keyring=/var/lib/rook/kube-system/client.admin.keyring --format json --out-file /tmp/172316857
2020-08-29 13:50:22.293132 I | op-cluster: cluster "kube-system": version "14.2.11-0 nautilus" detected for image "ceph/ceph:v14.2.11"
2020-08-29 13:50:22.309413 I | op-cluster: CephCluster "kube-system" status: "Creating".
2020-08-29 13:50:22.336763 I | op-mon: start running mons
2020-08-29 13:50:22.356779 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"kube-system","monitors":[]}] data: mapping:{"node":{}} maxMonId:-1]
I can confirm the operator crashing when LeaderElection fails, the Pod restarted after that (and did 4 times while doing my disaster recovery today):
2020-08-29 14:50:00.052851 I | op-k8sutil: updating deployment rook-ceph-rgw-external-s3-b
I0829 14:50:12.611893 6 leaderelection.go:263] failed to renew lease rook-system/ceph.rook.io-block: failed to tryAcquireOrRenew context deadline exceeded
F0829 14:50:12.611947 6 controller.go:847] leaderelection lost
I0829 14:50:12.617823 6 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"rook-system", Name:"ceph.rook.io-block", UID:"b4d603f6-7376-11ea-b395-6ae4fb1356bb", APIVersion:"v1", ResourceVersion:"207481865", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' rook-ceph-operator-ff9b5947-7b6pm_63df7ed9-ea06-11ea-b024-0242c0a8100a stopped leading
My etcd cluster has some problems when I have high I/O, it's some VPS with hard disks and the same fs is used for system, docker, etcd and rook (with filestore OSDs). Not ideal, but best I currently have. I'm running 3 controller nodes with a fully replicated control plane (each having etcd, kube-apiserver, kube-scheduler and kube-controller-manager). My controller nodes are worker nodes, too (so kube-proxy and kubelet). Control plane custom deployed with ansible, running on debian 10. Currently k8s 1.19.0-rc.4, I think this problem started with upgrading to the rc release - but also have dualstack since then.
Please tell me what you may need. I will recover my cluster now and keep the operator scaled down for now, it makes more harm than good currently ^^'
Disaster recovery
To recover I have to do the following steps:
data:
csi-cluster-config-json: '[{"clusterID":"kube-system","monitors":["$monitor_ips:6789",...]}]'
data: $monitor_name=$monitor_ip:6789,...
mapping: '{"node":{"$monitor_name":{"Name":"$monitor_node","Hostname":"$monitor_node","Address":"$monitor_node_ip"},...}'
maxMonId: "33" (this is for "ah", use highest in use for you)
data:
osd-dirs: '{"/var/lib/rook":$osd_id_on_this_node}' # might by filestore specific
kubectl delete deploy -l app=rook-ceph-osd
kubectl delete deploy -l app=rook-ceph-mon
kubectl delete job -l app=rook-ceph-osd-prepare
apiVersion: v1
kind: Service
metadata:
labels:
app: rook-ceph-mon
ceph_daemon_id: $mon_id
mon: $mon_id
mon_cluster: kube-system
rook_cluster: kube-system
name: rook-ceph-mon-$mon_id
namespace: kube-system
ownerReferences:
- apiVersion: ceph.rook.io/v1
blockOwnerDeletion: true
kind: CephCluster
name: rook
uid: $ceph_cluster_uuid # not sure how important that is
spec:
ipFamily: IPv4
clusterIP: $mon_ip
ports:
- name: msgr1
port: 6789
protocol: TCP
targetPort: 6789
- name: msgr2
port: 3300
protocol: TCP
targetPort: 3300
selector:
app: rook-ceph-mon
ceph_daemon_id: $mon_id
mon: $mon_id
mon_cluster: kube-system
rook_cluster: kube-system
sessionAffinity: None
type: ClusterIP
Look at this closely, it was copy&pasted in a hurry and might have things in it specific to my environment. I just wanted to fix mine, maybe help someone and continue watch Return of the King with my GF ^^'
Happened again, all deployments (osd, mgr, mon, rgw) and services for monitors gone. Operator was not running, when starting it, it created a new cluster from scratch. Seems like the operator isn't removing the things. Had high load on my cluster due to Ceph rebalance (migrated my first OSD to bluestore), maybe etcd had a major hickup from that.
Operator was not running, deployments, services and stuff completely removed. Looked in the correct logs this time: garbage collector removed them. Now tracking why
My problem seems to be the k8s garbage collector, probably matching issue: https://github.com/kubernetes/kubernetes/issues/88097
I disabled the GC for now and report back if the problem comes again
I ran into this issue a few weeks ago.
When the control plane becomes unresponsive for some time (can replicate via load testing with scheduleable masters...this was a test environment), Kubernetes will rebuild its internal object graph and objects like the Ceph mon, osd, mgr will get cleaned up by the GC due to the CephCluster uid (if I recall correctly) no longer matching what's given in the ownerRefs, which triggers the rebuild.
Since this environment was a bare metal platform with static addresses, I worked around it by removing the ownerRefs from the objects and setting the external toggle in CephCluster.
I read somewhere that the controller, in this case Rook, would have to adopt the orphaned objects (before GC); but this is a bit racy as the operator could come up after objects have been GCed.
In my case the UID still matched, but with apiserver and etcd being unstable, the GC didn't load the resources, but also didn't bail out on error, at least it looked like that. You sure the UID didn't match? @d-luu ?
@d-luu When the GC rebuilds its tree, the UUIDs will change every time and in that case the GC is expected to ignore the UUIDs that don't match. The GC will only work based on the other ownerReference attributes where the UUID changes, so there must be some other issue.
@d-luu When the GC rebuilds its tree, the UUIDs will change every time and in that case the GC is expected to ignore the UUIDs that don't match. The GC will only work based on the other ownerReference attributes where the UUID changes, so there must be some other issue.
aren't the UUIDs generated by the apiserver and GC only uses them for building it's internal tree?
I was testing in a cluster, which for sometime the Master node was in NotReady state - Rebooted to bring it back. Also restarted rook-operator after that. It is hanging on the last log
2020-12-10 07:29:55.476626 I | rookcmd: starting Rook v1.5.1 with arguments '/usr/local/bin/rook ceph operator'
2020-12-10 07:29:55.476934 I | rookcmd: flag values: --add_dir_header=false, --alsologtostderr=false, --csi-cephfs-plugin-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin.yaml, --csi-cephfs-provisioner-dep-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-dep.yaml, --csi-rbd-plugin-template-path=/etc/ceph-csi/rbd/csi-rbdplugin.yaml, --csi-rbd-provisioner-dep-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-dep.yaml, --enable-discovery-daemon=false, --enable-flex-driver=false, --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-flush-frequency=5s, --log-level=INFO, --log_backtrace_at=:0, --log_dir=, --log_file=, --log_file_max_size=1800, --logtostderr=true, --master=, --mon-healthcheck-interval=45s, --mon-out-timeout=10m0s, --operator-image=, --service-account=, --skip_headers=false, --skip_log_headers=false, --stderrthreshold=2, --v=0, --vmodule=
2020-12-10 07:29:55.476947 I | cephcmd: starting Rook-Ceph operator
2020-12-10 07:29:55.687410 I | cephcmd: base ceph version inside the rook operator image is "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)"
2020-12-10 07:29:55.693283 I | operator: looking for secret "rook-ceph-admission-controller"
2020-12-10 07:29:55.696305 I | operator: secret "rook-ceph-admission-controller" not found. proceeding without the admission controller
2020-12-10 07:29:55.697893 I | operator: watching all namespaces for ceph cluster CRs
2020-12-10 07:29:55.700078 I | operator: setting up the controller-runtime manager
2020-12-10 07:29:55.702261 I | ceph-cluster-controller: ConfigMap "rook-ceph-operator-config" changes detected. Updating configurations
2020-12-10 07:29:56.260615 I | ceph-cluster-controller: successfully started
2020-12-10 07:29:56.266295 I | ceph-cluster-controller: enabling hotplug orchestration
2020-12-10 07:29:56.266409 I | ceph-crashcollector-controller: successfully started
2020-12-10 07:29:56.267027 I | ceph-block-pool-controller: successfully started
2020-12-10 07:29:56.267164 I | ceph-object-store-user-controller: successfully started
2020-12-10 07:29:56.267276 I | ceph-object-realm-controller: successfully started
2020-12-10 07:29:56.267399 I | ceph-object-zonegroup-controller: successfully started
2020-12-10 07:29:56.267495 I | ceph-object-zone-controller: successfully started
2020-12-10 07:29:56.267853 I | ceph-object-controller: successfully started
2020-12-10 07:29:56.267986 I | ceph-file-controller: successfully started
2020-12-10 07:29:56.268117 I | ceph-nfs-controller: successfully started
2020-12-10 07:29:56.268396 I | operator: starting the controller-runtime manager
2020-12-10 07:29:56.572527 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2020-12-10 07:29:56.580249 I | op-k8sutil: ROOK_CSI_ENABLE_RBD="true" (configmap)
2020-12-10 07:29:56.589101 I | op-k8sutil: ROOK_CSI_ENABLE_CEPHFS="true" (configmap)
2020-12-10 07:29:56.594377 I | op-k8sutil: ROOK_CSI_ALLOW_UNSUPPORTED_VERSION="false" (configmap)
2020-12-10 07:29:56.596255 I | op-k8sutil: ROOK_CSI_ENABLE_GRPC_METRICS="true" (configmap)
2020-12-10 07:29:56.602286 I | op-k8sutil: ROOK_CSI_CEPH_IMAGE="quay.io/cephcsi/cephcsi:v3.1.2" (default)
2020-12-10 07:29:56.608561 I | op-k8sutil: ROOK_CSI_REGISTRAR_IMAGE="k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.0.1" (default)
2020-12-10 07:29:56.613899 I | op-k8sutil: ROOK_CSI_PROVISIONER_IMAGE="k8s.gcr.io/sig-storage/csi-provisioner:v2.0.0" (default)
2020-12-10 07:29:56.617426 I | op-k8sutil: ROOK_CSI_ATTACHER_IMAGE="k8s.gcr.io/sig-storage/csi-attacher:v3.0.0" (default)
2020-12-10 07:29:56.620742 I | op-k8sutil: ROOK_CSI_SNAPSHOTTER_IMAGE="k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.0" (default)
2020-12-10 07:29:56.623337 I | op-k8sutil: ROOK_CSI_KUBELET_DIR_PATH="/var/lib/kubelet" (default)
2020-12-10 07:29:56.780412 I | op-k8sutil: ROOK_CSI_CEPHFS_POD_LABELS="" (default)
2020-12-10 07:29:56.979781 I | op-k8sutil: ROOK_CSI_RBD_POD_LABELS="" (default)
2020-12-10 07:29:57.386031 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config"
2020-12-10 07:29:57.386324 I | ceph-csi: detecting the ceph csi image version for image "quay.io/cephcsi/cephcsi:v3.1.2"
2020-12-10 07:29:57.779476 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="" (default)
2020-12-10 07:29:57.980047 I | op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789
2020-12-10 07:29:58.383299 I | ceph-cluster-controller: detecting the ceph image version for image ceph/ceph:v15.2.5...
2020-12-10 07:30:02.140520 I | ceph-csi: Detected ceph CSI image version: "v3.1.2"
2020-12-10 07:30:02.148123 I | op-k8sutil: CSI_FORCE_CEPHFS_KERNEL_CLIENT="true" (configmap)
2020-12-10 07:30:02.149853 I | op-k8sutil: CSI_CEPHFS_GRPC_METRICS_PORT="9091" (default)
2020-12-10 07:30:02.151772 I | op-k8sutil: CSI_CEPHFS_LIVENESS_METRICS_PORT="9081" (default)
2020-12-10 07:30:02.158089 I | op-k8sutil: CSI_RBD_GRPC_METRICS_PORT="9090" (default)
2020-12-10 07:30:02.162112 I | op-k8sutil: CSI_RBD_LIVENESS_METRICS_PORT="9080" (default)
2020-12-10 07:30:02.163769 I | op-k8sutil: CSI_PLUGIN_PRIORITY_CLASSNAME="" (default)
2020-12-10 07:30:02.166017 I | op-k8sutil: CSI_PROVISIONER_PRIORITY_CLASSNAME="" (default)
2020-12-10 07:30:02.167769 I | op-k8sutil: CSI_CEPHFS_PLUGIN_UPDATE_STRATEGY="RollingUpdate" (default)
2020-12-10 07:30:02.314579 I | op-k8sutil: CSI_RBD_PLUGIN_UPDATE_STRATEGY="RollingUpdate" (default)
2020-12-10 07:30:02.314617 I | ceph-csi: Kubernetes version is 1.19
2020-12-10 07:30:02.513633 I | op-k8sutil: ROOK_CSI_RESIZER_IMAGE="k8s.gcr.io/sig-storage/csi-resizer:v1.0.0" (default)
2020-12-10 07:30:02.727730 I | op-k8sutil: CSI_LOG_LEVEL="" (default)
2020-12-10 07:30:03.133505 I | ceph-csi: successfully started CSI Ceph RBD
2020-12-10 07:30:03.140987 I | ceph-csi: successfully started CSI CephFS driver
2020-12-10 07:30:03.315440 I | ceph-cluster-controller: detected ceph image version: "15.2.5-0 octopus"
2020-12-10 07:30:03.315481 I | ceph-cluster-controller: validating ceph version from provided image
2020-12-10 07:30:03.514113 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="" (default)
2020-12-10 07:30:03.912858 I | op-k8sutil: CSI_PROVISIONER_NODE_AFFINITY="" (default)
2020-12-10 07:30:04.116994 I | op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789
2020-12-10 07:30:04.333367 I | op-k8sutil: CSI_PLUGIN_TOLERATIONS="" (default)
2020-12-10 07:30:04.515300 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2020-12-10 07:30:04.515774 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2020-12-10 07:30:04.713735 I | op-k8sutil: CSI_PLUGIN_NODE_AFFINITY="" (default)
2020-12-10 07:30:04.887648 I | ceph-cluster-controller: cluster "rook-ceph": version "15.2.5-0 octopus" detected for image "ceph/ceph:v15.2.5"
2020-12-10 07:30:04.917199 I | op-k8sutil: CSI_RBD_PLUGIN_RESOURCE="" (default)
2020-12-10 07:30:05.120212 I | op-mon: start running mons
2020-12-10 07:30:05.313566 I | op-k8sutil: CSI_RBD_PROVISIONER_RESOURCE="" (default)
2020-12-10 07:30:05.913455 I | op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789
2020-12-10 07:30:06.722179 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.107.237.178:6789","10.105.199.73:6789","10.109.127.93:6789"]}] data:a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789 mapping:{"node":{"a":{"Name":"k8s-cluster-2.novalocal","Hostname":"k8s-cluster-2.novalocal","Address":"172.16.0.4"},"b":{"Name":"k8s-cluster-4.novalocal","Hostname":"k8s-cluster-4.novalocal","Address":"172.16.0.6"},"c":{"Name":"k8s-cluster-1.novalocal","Hostname":"k8s-cluster-1.novalocal","Address":"172.16.0.3"},"d":{"Name":"k8s-cluster-4.novalocal","Hostname":"k8s-cluster-4.novalocal","Address":"172.16.0.6"},"e":{"Name":"k8s-cluster-1.novalocal","Hostname":"k8s-cluster-1.novalocal","Address":"172.16.0.3"},"f":{"Name":"k8s-cluster-1.novalocal","Hostname":"k8s-cluster-1.novalocal","Address":"172.16.0.3"}}} maxMonId:5]
2020-12-10 07:30:06.914123 I | op-k8sutil: CSI_CEPHFS_PLUGIN_RESOURCE="" (default)
2020-12-10 07:30:07.313394 I | op-k8sutil: CSI_CEPHFS_PROVISIONER_RESOURCE="" (default)
2020-12-10 07:30:07.913466 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2020-12-10 07:30:07.913860 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2020-12-10 07:30:08.528228 I | ceph-csi: CSIDriver object updated for driver "rook-ceph.rbd.csi.ceph.com"
2020-12-10 07:30:08.536394 I | ceph-csi: CSIDriver object updated for driver "rook-ceph.cephfs.csi.ceph.com"
2020-12-10 07:30:09.515474 I | op-mon: targeting the mon count 3
though it is finding the three mons
alex@N-20HEPF0ZU9PR:/mnt/c/Users/acp/Documents/Coding/daas_project/dc-deployments/argocd/apps/vm-1$ kubectl -n rook-ceph get pods
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-j2fq5 3/3 Running 0 168m
csi-cephfsplugin-j95sv 3/3 Running 0 151m
csi-cephfsplugin-provisioner-7dc78747bf-4q48l 6/6 Running 12 3h26m
csi-cephfsplugin-provisioner-7dc78747bf-hznsj 6/6 Running 0 44m
csi-cephfsplugin-v5grb 3/3 Running 0 3h26m
csi-rbdplugin-dj9gb 3/3 Running 0 168m
csi-rbdplugin-ffbmf 3/3 Running 0 151m
csi-rbdplugin-pkjww 3/3 Running 0 3h26m
csi-rbdplugin-provisioner-54d48757b4-s5tfm 6/6 Running 13 3h26m
csi-rbdplugin-provisioner-54d48757b4-wh8ww 6/6 Running 0 44m
rook-ceph-crashcollector-k8s-cluster-1.novalocal-6d65fc6d4clsmn 0/1 Init:0/2 0 44m
rook-ceph-crashcollector-k8s-cluster-2.novalocal-58cbd7d5d6c8mf 0/1 Init:0/2 0 95m
rook-ceph-crashcollector-k8s-cluster-4.novalocal-749d9cfb68mrsm 0/1 Init:0/2 0 84m
rook-ceph-mon-a-99b8744cf-wglqj 1/1 Running 0 76s
rook-ceph-mon-d-85d778df5f-9xxhs 1/1 Running 0 76s
rook-ceph-mon-f-764df56958-2dn48 1/1 Running 0 75s
rook-ceph-operator-78554769bd-dgwng 1/1 Running 0 50s
rook-ceph-tools-6f44db7c58-xgpt9 1/1 Running 0 3h32m
I was testing in a cluster, which for sometime the Master node was in NotReady state - Rebooted to bring it back. Also restarted rook-operator after that. It is hanging on the last log
@alexcpn If you only have three mons named a,d,f it doesn't sound like your cluster is healthy. Was the cluster in a healthy state before the master had the issue and you restarted?
From this line, it looks like the mons were already named a,d,f so at least it didn't destroy the cluster metadata as suggested could happen with this original issue. However, mons a,d,f likely means things were not previously healthy anyway.
op-mon: parsing mon endpoints: a=10.107.237.178:6789,d=10.105.199.73:6789,f=10.109.127.93:6789
@travisn why skipped letter means "cluster is not healthy"? I run a cluster with monitors named i
, l
, n
because I have replaced nodes multiple times. Or is it particularly to the case when no osds are running as well?
@travisn why skipped letter means "cluster is not healthy"? I run a cluster with monitors named
i
,l
,n
because I have replaced nodes multiple times. Or is it particularly to the case when no osds are running as well?
@zerkms The mons can be other letters like your example, that can be expected. Specifically, if the mons are named a,d,f this case specifically is probably an error 99% of the time. This is basically an operator bug that tries to create all three mons instead of stopping when the first one doesn't form quorum.
I'd like to contribute my two cents. We were hit by this a month ago, and unfortunately, did not know about this issue at that time, and it was more important for the cluster in question to resume service than to recover the data, so we started from scratch and have no logs to contribute. What we can contribute, however, is that we are almost 100% certain this was triggered by an upgrade to the cluster through "rke up" (Rancher). We were able to upgrade from k8s 1.16 to 1.17 to 1.18 without issue, however, the issue manifested very shortly after the upgrade from 1.18 to 1.19 . At the time, we only had 1 api server, 1 etcd, and 1 mon (bad idea, I know, but it was a development cluster), but 3 osds. After rebuilding the ceph cluster, we were able to do an upgrade from k8s 1.19 to 1.20 without issue (with 3 api servers, etcd, and mons this time). We don't have the hardware available to try a do a full repro, but perhaps someone else can take this information and run with it.
While we (the maintainers) are concerned about this issue, closing due to the age of the issue and lack of repro. If more details are found for this issue, we will certainly seek to address it.
First, thanks for rook!
We had an incident with our k8s control plane that cascaded in an especially bad way with rook. We thought this ticket might help identify bugs or areas to improve in rook.
Deviation from expected behavior:
Expected behavior:
Expected rook not to be orchestrating when no changes were made and expected it to be able to identify valid responses before making changes to state.
How to reproduce it (minimal and precise):
File(s) to submit:
Sequence of events assuming I didn't miss anything:
00:00:00
a process kicks off causing high disk io on all the nodes hosting kube-etcd causing error responses and timeouts for everything including rook operator. This lasts for about 2 minutes.00:00:13
rook fails to submit leader lease because of the instability of kube api and crashes/exits/restarts00:00:17
a new rook container has started up and begins its orchestration cycle with an increased error response rate from kube api00:01:15
rook seems to understand the existing healthy mons[d e f]
exist (these are upgraded mons for msgr2 conversion months ago)00:01:15
rook logs an error that it failed to update mgr and starts rook-ceph-version-detection job. Jobs output has the version in it but rook doesn't seem to be getting the message00:01:45
kube-etcd and kube-api are completely stabilized00:03:14
rook logs another error cephconfig: clusterInfo:Existing long-running rook-ceph-operator container logs
Restarted rook-ceph-operator container logs
At this point we are in a disaster recovery situation:
https://github.com/rook/rook/blob/release-1.1/Documentation/disaster-recovery.md
Environment: