openkruise / kruise

Automated management of large-scale applications on Kubernetes (incubating project under CNCF)
https://openkruise.io
Other
4.64k stars 763 forks source link

[BUG] cloneset-controller MAYBE stuck forever because of Event missing #1785

Open Spground opened 6 days ago

Spground commented 6 days ago

What happened:

cloneset-controller stuck in reconcile to wait ScaleExpectations statisfied.

What you expected to happen:

cloneset-controller will never stuck, and continue to reconcile when ScaleExpectations timeout.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:


t1: client-go@kruise-manager watch rv=1

t2: cloneset-controller@kruise-manager create pod A, rv=100, and expectation pod A Create Event using scaleExpectations.ExpectScale(podA)

t3: tons of watch Event coming, now, APIServer watch cache ranage be [rv=100, rv=1000]

t4: pod A deleted by others from etcd

t5: watch connection with APIServer randomly broken, client-go@kruise-manager re-watch with rv=1 and recieved "too old resource versione" becuase of slow watch Event handling(maybe Event produced too fast or Event consumed too slow), then re-list Pods

t6: after re-list Pods, cloneset-controller@kruise-manager will never recieved pod A Create&Delete Event becuase pod A was deleted at t3 

t7: cloneset-controller@kruise-manager will stuck forever util restart

How to fix


    if scaleSatisfied, unsatisfiedDuration, scaleDirtyPods := clonesetutils.ScaleExpectations.SatisfiedExpectations(request.String()); !scaleSatisfied {
        if unsatisfiedDuration >= expectations.ExpectationTimeout {
            klog.Warningf("Expectation unsatisfied overtime for %v, scaleDirtyPods=%v, overtime=%v", request.String(), scaleDirtyPods, unsatisfiedDuration)
// should delete expectation when timeout
            clonesetutils.ScaleExpectations.DeleteExpectations(request.String())
            return reconcile.Result{RequeueAfter: 10* time.Seconds}, nil
        }
        klog.V(4).Infof("Not satisfied scale for %v, scaleDirtyPods=%v", request.String(), scaleDirtyPods)
        return reconcile.Result{RequeueAfter: expectations.ExpectationTimeout - unsatisfiedDuration}, nil
    }

Environment:

furykerry commented 6 days ago

possible duplication of https://github.com/openkruise/kruise/issues/1765