Closed lance5890 closed 7 months ago
I think the race condition is absolutely valid @lance5890 - I'd put it into our bug backlog. The PR wasn't entirely wrong either, I would need to check whether we mutate some state in between that we won't get back after and update accordingly.
I think the next three weeks might be busier for us because of the upcoming 4.16 feature freeze, but after that we will definitely take a look at it.
Here's the bug ticket for further reference :) https://issues.redhat.com/browse/OCPBUGS-31849
I think the race condition is absolutely valid @lance5890 - I'd put it into our bug backlog. The PR wasn't entirely wrong either, I would need to check whether we mutate some state in between that we won't get back after and update accordingly.
I think the next three weeks might be busier for us because of the upcoming 4.16 feature freeze, but after that we will definitely take a look at it.
After thinking for a while, I think this PR will not address this issue completely, I have prepared another PR to fix this, which I think will address the issue completely, add and modify the UT at the same time. maybe tomorrow
@lance5890 I think I found a more reliable way to tackle this, directly at the root of the static pod installation controller: https://issues.redhat.com/browse/ETCD-612
So that way we don't have to continue to spread this check everywhere.
@lance5890 I think I found a more reliable way to tackle this, directly at the root of the static pod installation controller: https://issues.redhat.com/browse/ETCD-612
So that way we don't have to continue to spread this check everywhere.
yep,it is better to stop a new installer pod when quorum check failed
we found the etcd has also rollouted occasionally when we remove one master from cluster(3 masters), and the logs show as follows:
as the EtcdCertSignerController check the quorum at the beginning, which may be missed when apply the secret
as the etcdendpointscontroller check the quorum before update the cm https://github.com/openshift/cluster-etcd-operator/blob/3d5483e1871ba147a692736c71645175a85769c4/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L146-L153
I found this issue in ocp 4.12, but i think all the related versions have this issue