bug: kruize: gpus not allocatable

schwesig commented 1 month ago

Motivation

message from the kruize team (https://redhat-internal.slack.com/archives/C07NJB3URP1/p1729578977380869)
about
- https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/
- GPU node: wrk-5 (https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/k8s/cluster/nodes/wrk-5)

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                Requests     Limits
  --------                --------     ------
  cpu                     804m (0%)    10m (0%)
  memory                  3213Mi (0%)  0 (0%)
  ephemeral-storage       0 (0%)       0 (0%)
  hugepages-1Gi           0 (0%)       0 (0%)
  hugepages-2Mi           0 (0%)       0 (0%)
  nvidia.com/gpu          0            0
  nvidia.com/mig-3g.20gb  0            0
  nvidia.com/mig-4g.20gb  0            0

Capacity:
  cpu:                     128
  ephemeral-storage:       468097540Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1056462096Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0
  pods:                    250

Allocatable:
  cpu:                     127500m
  ephemeral-storage:       430324950326
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1055311120Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0

I see that the pods are not getting launched due to insufficient GPU resources

on (https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/k8s/ns/nvidia-gpu-operator/pods/nvidia-device-plugin-daemonset-lk48j/logs)
- nvidia-device-plugin-daemonset-lk48j: CrashLoopBackOff
- E1022 12:46:50.325831 1 main.go:132] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: at least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

Completion Criteria

pods can launch
GPUs can be allocated

Description

[x] Finding error
[ ] maybe something to do with
- [ ] https://github.com/nerc-project/operations/issues/773 (machine config applied on Sunday) --> NERC team
- [ ] gap between ACM initial created basic cluster and manually added configs --> kruize team
[ ] Fix configs or settings (tbd)

Completion dates

Desired - 2024-10-23 Required - 2024-10-25

Involved

@schwesig @shekhar316 @bharathappali @tssala23 @dystewart

maybe/FYI

schwesig commented 1 month ago

asked kruize team for checking MIG configuration

dystewart commented 1 month ago

@schwesig we should also have kruize check their clusterPolicy for errors

schwesig commented 1 month ago

connected to this

https://github.com/nerc-project/operations/issues/768

computate commented 1 month ago

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

bharathappali commented 4 weeks ago

@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).

Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.

schwesig commented 3 weeks ago

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

@computate

time="2024-10-25T08:27:35Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-10-25T08:27:36Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:41Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:46Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:51Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:56Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"

schwesig commented 3 weeks ago

fyi:

tried https://docs.openshift.com/container-platform/4.15/nodes/nodes/nodes-nodes-rebooting.html not successful, didn't solve the problem yet

- oc adm cordon wrk-5
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
- oc debug node/wrk-5
- chroot /host
- systemctl reboot

schwesig commented 3 weeks ago

checks from today

wrk-5 yaml: nvidia.com/mig.config: all-disabled (line 32)

There were repeated attempts to apply the kruize-mig-config, which failed due to its absence in the configuration file. The system logged "selected mig-config not present", leading to the nvidia.com/mig.config.state label being set to 'failed'.

The configuration was changed to all-2g.10gb, enabling multiple instances on GPUs. The configuration change involved shutting down GPU clients and resetting all GPUs, resulting in a successful application (nvidia.com/mig.config.state=success).

A reversion to the all-disabled configuration succeeded after shutting down relevant clients, applying the changes, and setting the mig.config.state back to success.

Unsliced GPUs can be selected again. Sliced one not.

schwesig commented 3 weeks ago

@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).

Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.

thanks for going through it this morning. That at least worked for a workaround.

schwesig commented 3 weeks ago

adding this link we used with Cristiano to set it to "default" https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/mig-ocp.html

schwesig commented 3 weeks ago

oc get node wrk-5 -o yaml MIG_CONFIGURATION=all-2g.10gb && \n oc label node/wrk-5 nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite oc get node wrk-5 -o yaml MIG_CONFIGURATION=kruize-mig-config && \n oc label node/wrk-5 nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite oc get node wrk-5 -o yaml

computate commented 3 weeks ago

@schwesig @bharathappali NVIDIA discovered a fix for the NVIDIA Driver issues! Please review this PR and consider adding these fixes to your cluster.

https://github.com/OCP-on-NERC/nerc-ocp-config/pull/586

schwesig commented 2 weeks ago

kruize succesfully added the fixes to the cruise /test2 cluster yesterday

nerc-project / operations