project-codeflare / multi-cluster-app-dispatcher

Holistic job manager on Kubernetes
Apache License 2.0
107 stars 62 forks source link

MCAD doesn't take into account the node tainting #512

Open kpouget opened 1 year ago

kpouget commented 1 year ago

As part of my Codeflare/MCAD test automation, I observed the following behavior:

mnisttest-user1-head-v9hbd                                        0/1     Pending     0              8m4s    <none>         <none>                         <none>           <none>
nisttest-user1-worker-small-group-mnisttest-user1-hmt8p           0/1     Pending     0              8m4s    <none>         <none>                         <none>           <none>
nisttest-user1-worker-small-group-mnisttest-user1-jhqkm           0/1     Pending     0              8m4s    <none>         <none>                         <none>           <none>

My understanding is that MCAD did not take into account the taint in one of the nodes (1 node(s) had untolerated taint {only-test-pods: yes}) when it decided that the AppWrapper would fit in the cluster. Unfortunately, the only node available for this workload did not have enough CPU to host the RayCluster 1 Insufficient cpu.


kpouget commented 1 year ago

AppWrapper state when the RayCluster is pending: appwrapper.yaml.log (from another identical test run)