k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
78 stars 109 forks source link

If the pod expulsion or removal fails during the nodedrain operation, the entire node will be schdulingDisabled #429

Closed lizhewei91 closed 6 days ago

lizhewei91 commented 1 year ago

question: If pod expulsion or removal fails during the nodedrain operation, the entire node will be schdulingDisabled. If the operation continues to fail, the entire node will be affected.

query: I have a question here, why do I need to run RunCordonOrUncordon and set the node to schedulingDisabled before executing RunNodeDrain?

What exception will occur if a pod dispatches this node and consumes vf during drain?

propose: Could it be modified to not perform the cordn operation before drain operation?

https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/815fd134ba8000756791051fca60179ec66ddb46/pkg/daemon/daemon.go#L877

if err = wait.ExponentialBackoff(backoff, func() (bool, error) {
        err := drain.RunCordonOrUncordon(dn.drainer, dn.node, true)
        if err != nil {
            lastErr = err
            glog.Infof("Cordon failed with: %v, retrying", err)
            return false, nil
        }
        err = drain.RunNodeDrain(dn.drainer, dn.name)
        if err == nil {
            return true, nil
        }
        lastErr = err
        glog.Infof("Draining failed with: %v, retrying", err)
        return false, nil
    }); err != nil {
        if err == wait.ErrWaitTimeout {
            glog.Errorf("drainNode(): failed to drain node (%d tries): %v :%v", backoff.Steps, err, lastErr)
        }
        glog.Errorf("drainNode(): failed to drain node: %v", err)
        return err
    }

Detailed log information:

I0410 03:48:12.978145    8272 daemon.go:479] nodeStateSyncHandler(): plugin generic_plugin: reqDrain true, reqReboot false
I0410 03:48:12.978158    8272 daemon.go:483] nodeStateSyncHandler(): reqDrain true, reqReboot false disableDrain false
I0410 03:48:12.980314    8272 daemon.go:519] nodeStateSyncHandler(): drain node
I0410 03:48:12.980325    8272 daemon.go:896] drainNode(): Update prepared
I0410 03:48:12.980332    8272 daemon.go:906] drainNode(): Start draining
I0410 03:49:44.533318    8272 daemon.go:919] Draining failed with: error when evicting pods/"dev-sn-platform-broker-0" -n "imp-dev-pulsar": global timeout reached: 1m30s, retrying
I0410 03:51:26.597250    8272 daemon.go:919] Draining failed with: error when evicting pods/"dev-sn-platform-broker-0" -n "imp-dev-pulsar": global timeout reached: 1m30s, retrying
I0410 03:53:18.407265    8272 daemon.go:919] Draining failed with: error when evicting pods/"dev-sn-platform-broker-0" -n "imp-dev-pulsar": global timeout reached: 1m30s, retrying
I0410 03:55:30.406776    8272 daemon.go:919] Draining failed with: error when evicting pods/"dev-sn-platform-broker-0" -n "imp-dev-pulsar": global timeout reached: 1m30s, retrying
I0410 03:58:22.012313    8272 daemon.go:919] Draining failed with: error when evicting pods/"dev-sn-platform-broker-0" -n "imp-dev-pulsar": global timeout reached: 1m30s, retrying
E0410 03:58:22.012329    8272 daemon.go:923] drainNode(): failed to drain node (5 tries): timed out waiting for the condition :error when evicting pods/"dev-sn-platform-broker-0" -n "imp-dev-pulsar": global timeout reached: 1m30s
E0410 03:58:22.012340    8272 daemon.go:925] drainNode(): failed to drain node: timed out waiting for the condition{noformat}
lizhewei91 commented 1 year ago

@sosiouxme @booxter @s1061123 @fedepaol I have a question, please help to answer it, thank you

SchSeba commented 1 year ago

Hi @lizhewei91 the reason is to be sure no new workloads will be allocated on the node in the time we are doing the configuration.

you can have many race conditions if you just remove the pods for example a deployment that can only run on that node, if you don't cordon the node the workload will return after the operator removes it.

lizhewei91 commented 1 year ago

Hi @lizhewei91 the reason is to be sure no new workloads will be allocated on the node in the time we are doing the configuration.

you can have many race conditions if you just remove the pods for example a deployment that can only run on that node, if you don't cordon the node the workload will return after the operator removes it

Hi @SchSeba I get it. Considering that only this node is available, if the cordon operation is not performed, the pod will be restored by the workload after the operator drives out or deletes the pod. Now I am considering a scenario where, if pod fails to be evicted or deleted by an operator, the operator will continue to retry and the node will remain in the schdulingdisabled state, affecting other newly created services. This isa serious impact in the production environment. Therefore, I am considering whether it is possible to make an optimization in scheduling. If the newly built or rebuilt pod does not consume vf nics, it can be scheduled to this node. If a newly created or rebuilt pod needs vf cards, do not schedule the pod to this node during the scheduling phase. In this way, the normal use of the node is not affected.

SchSeba commented 8 months ago

Hi please take a look on the new draining capabilities we are working on https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/555

be aware that if the policy doesn't work the node will continue to retry but you can always remove the policy and the daemon will restore the node to ready

SchSeba commented 6 days ago

no update for a long time closing this issue