Closed lizhewei91 closed 6 days ago
@sosiouxme @booxter @s1061123 @fedepaol I have a question, please help to answer it, thank you
Hi @lizhewei91 the reason is to be sure no new workloads will be allocated on the node in the time we are doing the configuration.
you can have many race conditions if you just remove the pods for example a deployment that can only run on that node, if you don't cordon the node the workload will return after the operator removes it.
Hi @lizhewei91 the reason is to be sure no new workloads will be allocated on the node in the time we are doing the configuration.
you can have many race conditions if you just remove the pods for example a deployment that can only run on that node, if you don't cordon the node the workload will return after the operator removes it
Hi @SchSeba I get it. Considering that only this node is available, if the cordon operation is not performed, the pod will be restored by the workload after the operator drives out or deletes the pod. Now I am considering a scenario where, if pod fails to be evicted or deleted by an operator, the operator will continue to retry and the node will remain in the schdulingdisabled state, affecting other newly created services. This isa serious impact in the production environment. Therefore, I am considering whether it is possible to make an optimization in scheduling. If the newly built or rebuilt pod does not consume vf nics, it can be scheduled to this node. If a newly created or rebuilt pod needs vf cards, do not schedule the pod to this node during the scheduling phase. In this way, the normal use of the node is not affected.
Hi please take a look on the new draining capabilities we are working on https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/555
be aware that if the policy doesn't work the node will continue to retry but you can always remove the policy and the daemon will restore the node to ready
no update for a long time closing this issue
question: If pod expulsion or removal fails during the nodedrain operation, the entire node will be schdulingDisabled. If the operation continues to fail, the entire node will be affected.
query: I have a question here, why do I need to run RunCordonOrUncordon and set the node to schedulingDisabled before executing RunNodeDrain?
What exception will occur if a pod dispatches this node and consumes vf during drain?
propose: Could it be modified to not perform the cordn operation before drain operation?
https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/815fd134ba8000756791051fca60179ec66ddb46/pkg/daemon/daemon.go#L877
Detailed log information: