nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

bug: kruize: gpus not allocatable #782

Closed schwesig closed 2 weeks ago

schwesig commented 1 month ago

Motivation

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                Requests     Limits
  --------                --------     ------
  cpu                     804m (0%)    10m (0%)
  memory                  3213Mi (0%)  0 (0%)
  ephemeral-storage       0 (0%)       0 (0%)
  hugepages-1Gi           0 (0%)       0 (0%)
  hugepages-2Mi           0 (0%)       0 (0%)
  nvidia.com/gpu          0            0
  nvidia.com/mig-3g.20gb  0            0
  nvidia.com/mig-4g.20gb  0            0 
Capacity:
  cpu:                     128
  ephemeral-storage:       468097540Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1056462096Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0
  pods:                    250
Allocatable:
  cpu:                     127500m
  ephemeral-storage:       430324950326
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1055311120Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0

I see that the pods are not getting launched due to insufficient GPU resources

Completion Criteria

Description

Completion dates

Desired - 2024-10-23 Required - 2024-10-25

Involved

@schwesig @shekhar316 @bharathappali @tssala23 @dystewart

maybe/FYI

schwesig commented 1 month ago
dystewart commented 1 month ago

@schwesig we should also have kruize check their clusterPolicy for errors

schwesig commented 1 month ago

connected to this

computate commented 1 month ago

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

bharathappali commented 4 weeks ago

@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).

Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.

schwesig commented 3 weeks ago

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

@computate Image Image Image

time="2024-10-25T08:27:35Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-10-25T08:27:36Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:41Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:46Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:51Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:56Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
schwesig commented 3 weeks ago

fyi:

tried https://docs.openshift.com/container-platform/4.15/nodes/nodes/nodes-nodes-rebooting.html not successful, didn't solve the problem yet

- oc adm cordon wrk-5
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
- oc debug node/wrk-5
- chroot /host
- systemctl reboot

Image

schwesig commented 3 weeks ago

checks from today

wrk-5 yaml: nvidia.com/mig.config: all-disabled (line 32)

There were repeated attempts to apply the kruize-mig-config, which failed due to its absence in the configuration file. The system logged "selected mig-config not present", leading to the nvidia.com/mig.config.state label being set to 'failed'.

The configuration was changed to all-2g.10gb, enabling multiple instances on GPUs. The configuration change involved shutting down GPU clients and resetting all GPUs, resulting in a successful application (nvidia.com/mig.config.state=success).

A reversion to the all-disabled configuration succeeded after shutting down relevant clients, applying the changes, and setting the mig.config.state back to success.

Unsliced GPUs can be selected again. Sliced one not. Image Image Image

schwesig commented 3 weeks ago

@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).

Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.

thanks for going through it this morning. That at least worked for a workaround.

schwesig commented 3 weeks ago

adding this link we used with Cristiano to set it to "default" https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/mig-ocp.html

schwesig commented 3 weeks ago

oc get node wrk-5 -o yaml MIG_CONFIGURATION=all-2g.10gb && \n oc label node/wrk-5 nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite oc get node wrk-5 -o yaml MIG_CONFIGURATION=kruize-mig-config && \n oc label node/wrk-5 nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite oc get node wrk-5 -o yaml

computate commented 3 weeks ago

@schwesig @bharathappali NVIDIA discovered a fix for the NVIDIA Driver issues! Please review this PR and consider adding these fixes to your cluster.

https://github.com/OCP-on-NERC/nerc-ocp-config/pull/586

schwesig commented 2 weeks ago

kruize succesfully added the fixes to the cruise /test2 cluster yesterday