intel / ai-containers

This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow and PyTorch that have been optimized for Intel platforms. Scaling is done with python, Docker, kubernetes, kubeflow, cnvrg.io, Helm, and other container orchestration frameworks for use in the cloud and on-premise
https://intel.github.io/ai-containers/
Apache License 2.0
27 stars 17 forks source link

Intel® Gaudi AI SW Tools Operator version 0.0.1 fails to deploy on OCP 4.16.0 #518

Open braultatgithub opened 4 days ago

braultatgithub commented 4 days ago

Describe the bug

Intel® Gaudi AI SW Tools Operator version 0.0.1 fails to deploy on OCP 4.16.0

Message displayed: Operator failed install failed: deployment gaudi-ai-sw-tools-operator-controller-manager not ready before timeout: deployment "gaudi-ai-sw-tools-operator-controller-manager" exceeded its progress deadline

Error Logs

Pending
16 Nov 2024, 11:05
RequirementsUnknownrequirements not yet checked
Pending
16 Nov 2024, 11:05
RequirementsNotMetone or more requirements couldn't be found
InstallReady
16 Nov 2024, 11:05
AllRequirementsMetall requirements found, attempting install
Installing
16 Nov 2024, 11:05
InstallSucceededwaiting for install components to report healthy
Installing
16 Nov 2024, 11:05
InstallWaitinginstalling: waiting for deployment gaudi-ai-sw-tools-operator-controller-manager to become ready: deployment "gaudi-ai-sw-tools-operator-controller-manager" not available: Deployment does not have minimum availability.
Failed
16 Nov 2024, 11:10
InstallCheckFailedinstall timeout
Pending
16 Nov 2024, 11:10
NeedsReinstallinstalling: waiting for deployment gaudi-ai-sw-tools-operator-controller-manager to become ready: deployment "gaudi-ai-sw-tools-operator-controller-manager" not available: Deployment does not have minimum availability.
InstallReady
16 Nov 2024, 11:10
AllRequirementsMetall requirements found, attempting install
Installing
16 Nov 2024, 11:10
InstallSucceededwaiting for install components to report healthy
Installing
16 Nov 2024, 11:10
InstallWaitinginstalling: waiting for deployment gaudi-ai-sw-tools-operator-controller-manager to become ready: deployment "gaudi-ai-sw-tools-operator-controller-manager" not available: Deployment does not have minimum availability.
Failed
16 Nov 2024, 11:15
InstallCheckFailedinstall timeout
Pending
16 Nov 2024, 11:15
NeedsReinstallinstalling: waiting for deployment gaudi-ai-sw-tools-operator-controller-manager to become ready: deployment "gaudi-ai-sw-tools-operator-controller-manager" not available: Deployment does not have minimum availability.
InstallReady
16 Nov 2024, 11:15
AllRequirementsMetall requirements found, attempting install
Installing
16 Nov 2024, 11:15
InstallSucceededwaiting for install components to report healthy
Failed
16 Nov 2024, 11:15
InstallCheckFailedinstall failed: deployment gaudi-ai-sw-tools-operator-controller-manager not ready before timeout: deployment "gaudi-ai-sw-tools-operator-controller-manager" exceeded its progress deadline

Reproduction Instructions

Install openshift OCP 4.16
Attempt to deploy Intel® Gaudi AI SW Tools Operator

Affected Subfolder

Versions

Red Hat OpenShift Container Platform 4.16
braultatgithub commented 6 hours ago

Retested this with the latest OCP 4.16.23, and this time the operator was successfully installed.