Updating from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 stuck due to etcd ClusterOperator check FSyncControllerDegraded

RyuunoAelia commented 2 years ago

Describe the bug I started an update from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 and after one hour the process is still stuck because the etcd operator is unhealthy with reason:

FSyncControllerDegraded: etcd disk metrics exceeded known tresholds:  fsync duration value: 3.537920

AFAIK etcd reports all times in miliseconds, so 3.5ms fsync is considered too slow for etcd to actually work? Is it possible to ignore check for fsync performance and finish the cluster upgrade?

Version UPI 4.10.0-0.okd-2022-03-07-131213 (at least the etcd cluster operator has this version, now)

How reproducible

Use OKD4.9 masters with:

CPU: Intel(R) Core(TM) i3-5010U CPU @ 2.10GHz
RAM: Kingston 9905428-435.A01LF
SSD: Crucial CT120BX500SSD1

Try to update to 4.10 and see it get stuck.

vrutkovs commented 2 years ago

The alerts return time in seconds - https://github.com/openshift/cluster-etcd-operator/blob/f3d8db9c071c979b064063e97206f24398e9a854/pkg/operator/metriccontroller/fsync_controller.go#L103-L104

I don't think you'd be able to complete the upgrade with such slow etcd

RyuunoAelia commented 2 years ago

I don't understand where this 3 seconds metric comes from because the grafana dashboard for etcd performance has nowhere near a second of fsync times displayed.

RyuunoAelia commented 2 years ago

Oh! I think I got it. During the update a lot of images are pulled and containers started on the master nodes, so it does a lot of I/Os and made the sync spike a lot. But that was only during the update. The query here will still put the Operator in unhealthy mode if there is a single value higher than 3s in the whole history of the metrics. This is overkill I think. Anyway I will drop the history of prometheus and the update should continue.

vrutkovs commented 2 years ago

Probably related to https://github.com/openshift/cluster-etcd-operator/pull/755 - Degraded condition may not be properly removed

RyuunoAelia commented 2 years ago

So there really is a bug in the OKD version available in the stable channel. I am stuck in the middle of the update. Is there an ETA for the release of the fix in the stable channel? Or a workaround I can apply to continue the update?

RyuunoAelia commented 2 years ago

Removing the data from the prometheus history and even restarting the etcd-operator does not make the condition go away...

RyuunoAelia commented 2 years ago

@vrutkovs since the bug's existence has been acknowledged "upstream", could you reopen this report at least until the bugfix is released in OKD ?

vrutkovs commented 2 years ago

No, we do not need to duplicate every bugzilla issue as OKD github issues. This tracker is for OKD-specific bugs only.

Also, we don't know if this is the same bug - there is no must-gather attached to this ticket, thus it stays closed

RyuunoAelia commented 2 years ago

Your point about the Red Hat Bugzilla is taken. This is breaking new for me. I re-read the Readme of this repository, the template of issue creation, the whole https://www.okd.io/ website to check if I missed this piece of information but could not find it. Thank you for pointing out this policy, as it makes the whole picture clearer in my mind. In particular how OCP and OKD relates to each other.

For the must-gather, I can give it to you if you want, but since I am a paranoid security analyst I don't want to make it public and expect you to share it with only people that have a reason to have access to it. I know that seems radical but I expect some information being present in the must-gather that could give away information I prefer not being known to the whole world. Things like:

hostnames/ips of machines
name of namespaces
name of service accounts maybe even usernames (probable since I expect the must-gather to contain some part of the audit log of k8s)
etc

Some would say this is not sensitive, but as it provides information that could be used for pin-point social engineering. (Yes I am paranoid).

RyuunoAelia commented 2 years ago

Here I shared a google drive folder with you (the email address in your github profile), your rights are editor so you can modify the content, and you can edit the sharing to add more people if needed: https://drive.google.com/drive/folders/1xQtUF0I9r81bzjRU0wj6GTCExSHvAX45?usp=sharing

vrutkovs commented 2 years ago

I re-read the Readme of this repository, the template of issue creation, the whole okd.io website to check if I missed this piece of information but could not find it

Right, this is complicated as we never know if the issue is caused by FCOS (thus OKD-specific) or its reproducible in OCP (and needs to be taken care of in bugzilla), so we have to decide this in Github issues.

I can give it to you if you want

Unfortunately we have to do view publicly. I'm not the sole reviewer and OKD team consists of volunteers, so in order to keep in line with GDPR and CCSP we can't provide a secure way to handle user must-gathers.

RyuunoAelia commented 2 years ago

Is my previous link not enough to review the must-gather ? Any more "public" than that is "pretty crazy" in my paranoid mind.

RyuunoAelia commented 2 years ago

For the record, I found out the "hoops" that are mentioned in the bugzilla. The steps to recover from this state without having the patched version of the etcd-operator are as follow:

destroy the prometheus history, so that there is no trace of the fsync time peak
save the following objects from the API with oc get -o yaml clusterrolebinding/system:openshift:operator:etcd-operator clusteroperator/etcd etcd.operator.openshift.io/cluster > save.yaml
edit the saved yaml to remove the status field
delete the clusterrolebinding: oc delete clusterrolebinding/system:openshift:operator:etcd-operator so that the operator will not be able to keep the current state
delete the operator pod oc delete pod -n openshift-etcd-operator --all
delete the ClusterOperator and the Operator config: oc delete clusteroperator/etcd etcd.operator.openshift.io/cluster
restore the objects without the status fields: oc create -f save.yaml (this will remove the status field)
the operator pod in openshift-etcd-operator will start but will fail to get the lock during a long time, you can see in the logs, and after that everything will come back and the update will be performed normally

achilles-git commented 2 years ago

@RyuunoAelia did the upgrade successfully complete for you ???

RyuunoAelia commented 2 years ago

Yes the upgrade was completed successfully. Because my hardware is performant enough, the fsync time spike was only temporary.

okd-project / okd

Updating from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 stuck due to etcd ClusterOperator check FSyncControllerDegraded #1145