Open carlmes opened 3 days ago
When building a brand new cluster using these settings:
bootstrap.sh
6) rhoai-stable-2.13-aws-gpu
The GPU Operator in the ArgoCD is constantly going out of sync every couple of seconds:
Looking at the sync error, we see that Argo is trying to add these lines:
But the GPU Operator seems to be removing them.
Also, the monitoring console in OpenShift appears to be broken in this release:
It's probably just a new update within the NVIDIA Operator that must be incorporated into our kustomize templates in this project. Source code for the lines that will not synchronize are at: https://github.com/redhat-ai-services/ai-accelerator/blob/39a2cf4cb1d7278bb70f35587e1b68c8755fd5ea/components/operators/gpu-operator-certified/operator/components/console-plugin/consoleplugin.yaml#L17
Making a note that this failed twice in a row with 2 brand new 4.17 clusters, but appears to not be an issue in earlier versions such as 4.15 and 4.16 during enablement testing.
When building a brand new cluster using these settings:
bootstrap.sh
and choose:6) rhoai-stable-2.13-aws-gpu
The GPU Operator in the ArgoCD is constantly going out of sync every couple of seconds:
Looking at the sync error, we see that Argo is trying to add these lines:
But the GPU Operator seems to be removing them.
Also, the monitoring console in OpenShift appears to be broken in this release:
It's probably just a new update within the NVIDIA Operator that must be incorporated into our kustomize templates in this project. Source code for the lines that will not synchronize are at: https://github.com/redhat-ai-services/ai-accelerator/blob/39a2cf4cb1d7278bb70f35587e1b68c8755fd5ea/components/operators/gpu-operator-certified/operator/components/console-plugin/consoleplugin.yaml#L17