redhat-ai-services / ai-accelerator

The AI Accelerator is a template project for setting up Red Hat OpenShift AI using GitOps
28 stars 59 forks source link

NVIDIA GPU Operator Constantly Resynchronizing in ArgoCD #76

Open carlmes opened 3 days ago

carlmes commented 3 days ago

When building a brand new cluster using these settings:

The GPU Operator in the ArgoCD is constantly going out of sync every couple of seconds:

image

Looking at the sync error, we see that Argo is trying to add these lines:

image

But the GPU Operator seems to be removing them.

Also, the monitoring console in OpenShift appears to be broken in this release:

image

It's probably just a new update within the NVIDIA Operator that must be incorporated into our kustomize templates in this project. Source code for the lines that will not synchronize are at: https://github.com/redhat-ai-services/ai-accelerator/blob/39a2cf4cb1d7278bb70f35587e1b68c8755fd5ea/components/operators/gpu-operator-certified/operator/components/console-plugin/consoleplugin.yaml#L17

carlmes commented 3 days ago

Making a note that this failed twice in a row with 2 brand new 4.17 clusters, but appears to not be an issue in earlier versions such as 4.15 and 4.16 during enablement testing.