Closed schwesig closed 2 weeks ago
@schwesig we should also have kruize check their clusterPolicy for errors
connected to this
@schwesig you might check if the nvidia-operator-validator
pods in the nvidia-gpu-operator
namespace are failing to start with errors in the plugin-validation
container to confirm it's the same problem as above.
@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).
Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.
@schwesig you might check if the
nvidia-operator-validator
pods in thenvidia-gpu-operator
namespace are failing to start with errors in theplugin-validation
container to confirm it's the same problem as above.
@computate
time="2024-10-25T08:27:35Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-10-25T08:27:36Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:41Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:46Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:51Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:56Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
fyi:
tried https://docs.openshift.com/container-platform/4.15/nodes/nodes/nodes-nodes-rebooting.html not successful, didn't solve the problem yet
- oc adm cordon wrk-5
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
- oc debug node/wrk-5
- chroot /host
- systemctl reboot
checks from today
wrk-5 yaml: nvidia.com/mig.config: all-disabled (line 32)
There were repeated attempts to apply the kruize-mig-config, which failed due to its absence in the configuration file. The system logged "selected mig-config not present", leading to the nvidia.com/mig.config.state label being set to 'failed'.
The configuration was changed to all-2g.10gb, enabling multiple instances on GPUs. The configuration change involved shutting down GPU clients and resetting all GPUs, resulting in a successful application (nvidia.com/mig.config.state=success).
A reversion to the all-disabled configuration succeeded after shutting down relevant clients, applying the changes, and setting the mig.config.state back to success.
Unsliced GPUs can be selected again. Sliced one not.
@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).
Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.
thanks for going through it this morning. That at least worked for a workaround.
adding this link we used with Cristiano to set it to "default" https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/mig-ocp.html
oc get node wrk-5 -o yaml MIG_CONFIGURATION=all-2g.10gb && \n oc label node/wrk-5 nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite oc get node wrk-5 -o yaml MIG_CONFIGURATION=kruize-mig-config && \n oc label node/wrk-5 nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite oc get node wrk-5 -o yaml
@schwesig @bharathappali NVIDIA discovered a fix for the NVIDIA Driver issues! Please review this PR and consider adding these fixes to your cluster.
kruize succesfully added the fixes to the cruise /test2 cluster yesterday
Motivation
I see that the pods are not getting launched due to insufficient GPU resources
CrashLoopBackOff
Completion Criteria
Description
Completion dates
Desired - 2024-10-23 Required - 2024-10-25
Involved
@schwesig @shekhar316 @bharathappali @tssala23 @dystewart
maybe/FYI