Closed RyuunoAelia closed 2 years ago
The alerts return time in seconds - https://github.com/openshift/cluster-etcd-operator/blob/f3d8db9c071c979b064063e97206f24398e9a854/pkg/operator/metriccontroller/fsync_controller.go#L103-L104
I don't think you'd be able to complete the upgrade with such slow etcd
I don't understand where this 3 seconds metric comes from because the grafana dashboard for etcd performance has nowhere near a second of fsync times displayed.
Oh! I think I got it. During the update a lot of images are pulled and containers started on the master nodes, so it does a lot of I/Os and made the sync spike a lot. But that was only during the update. The query here will still put the Operator in unhealthy mode if there is a single value higher than 3s in the whole history of the metrics. This is overkill I think. Anyway I will drop the history of prometheus and the update should continue.
Probably related to https://github.com/openshift/cluster-etcd-operator/pull/755 - Degraded condition may not be properly removed
So there really is a bug in the OKD version available in the stable channel. I am stuck in the middle of the update. Is there an ETA for the release of the fix in the stable channel? Or a workaround I can apply to continue the update?
Removing the data from the prometheus history and even restarting the etcd-operator does not make the condition go away...
@vrutkovs since the bug's existence has been acknowledged "upstream", could you reopen this report at least until the bugfix is released in OKD ?
No, we do not need to duplicate every bugzilla issue as OKD github issues. This tracker is for OKD-specific bugs only.
Also, we don't know if this is the same bug - there is no must-gather attached to this ticket, thus it stays closed
Your point about the Red Hat Bugzilla is taken. This is breaking new for me. I re-read the Readme of this repository, the template of issue creation, the whole https://www.okd.io/ website to check if I missed this piece of information but could not find it. Thank you for pointing out this policy, as it makes the whole picture clearer in my mind. In particular how OCP and OKD relates to each other.
For the must-gather, I can give it to you if you want, but since I am a paranoid security analyst I don't want to make it public and expect you to share it with only people that have a reason to have access to it. I know that seems radical but I expect some information being present in the must-gather that could give away information I prefer not being known to the whole world. Things like:
Some would say this is not sensitive, but as it provides information that could be used for pin-point social engineering. (Yes I am paranoid).
Here I shared a google drive folder with you (the email address in your github profile), your rights are editor so you can modify the content, and you can edit the sharing to add more people if needed: https://drive.google.com/drive/folders/1xQtUF0I9r81bzjRU0wj6GTCExSHvAX45?usp=sharing
I re-read the Readme of this repository, the template of issue creation, the whole okd.io website to check if I missed this piece of information but could not find it
Right, this is complicated as we never know if the issue is caused by FCOS (thus OKD-specific) or its reproducible in OCP (and needs to be taken care of in bugzilla), so we have to decide this in Github issues.
I can give it to you if you want
Unfortunately we have to do view publicly. I'm not the sole reviewer and OKD team consists of volunteers, so in order to keep in line with GDPR and CCSP we can't provide a secure way to handle user must-gathers.
Is my previous link not enough to review the must-gather ? Any more "public" than that is "pretty crazy" in my paranoid mind.
For the record, I found out the "hoops" that are mentioned in the bugzilla. The steps to recover from this state without having the patched version of the etcd-operator are as follow:
oc get -o yaml clusterrolebinding/system:openshift:operator:etcd-operator clusteroperator/etcd etcd.operator.openshift.io/cluster > save.yaml
status
fieldoc delete clusterrolebinding/system:openshift:operator:etcd-operator
so that the operator will not be able to keep the current stateoc delete pod -n openshift-etcd-operator --all
oc delete clusteroperator/etcd etcd.operator.openshift.io/cluster
oc create -f save.yaml
(this will remove the status field)@RyuunoAelia did the upgrade successfully complete for you ???
Yes the upgrade was completed successfully. Because my hardware is performant enough, the fsync time spike was only temporary.
Describe the bug I started an update from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 and after one hour the process is still stuck because the etcd operator is unhealthy with reason:
AFAIK etcd reports all times in miliseconds, so 3.5ms fsync is considered too slow for etcd to actually work? Is it possible to ignore check for fsync performance and finish the cluster upgrade?
Version UPI 4.10.0-0.okd-2022-03-07-131213 (at least the etcd cluster operator has this version, now)
How reproducible
Use OKD4.9 masters with:
Try to update to 4.10 and see it get stuck.