The `7_get_data_train_upload.py` generated pipeline gets stuck on ROSA-hosted OCP cluster

adelton commented 3 months ago

I follow the tutorial at https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.7/html-single/openshift_ai_tutorial_-_fraud_detection_example/index which uses this repo https://github.com/rh-aiservices-bu/fraud-detection.

The section https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.7/html-single/openshift_ai_tutorial_-_fraud_detection_example/index#running-a-pipeline-generated-from-python-code shows the use of pipeline/7_get_data_train_upload.py to build pipeline/7_get_data_train_upload.yaml. (Small issue with that section reported in https://issues.redhat.com/browse/RHOAIENG-4448.)

However, when I import the generated pipeline YAML file, the triggered run keeps on being shown as Running in the OpenShift AI dashboard. Specifically, the get-data task is shown as Pending.

There sadly seems to be no way to debug this from the OpenShift AI dashboard. However, in OpenShift Console in the TaskRuns view, there is a stream of events

0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

shown.

Checking the YAML of the imported pipeline back in the OpenShift AI dashboard shows

  workspaces:
    - name: train-upload-stock-kfp
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi

Logging in as admin to OpenShift Console, I see that the fraud-detection PVC (as well as the one created for MinIO), are of the class gp3-csi, not gp3.

Should the pipeline/7_get_data_train_upload.py try not to force the storage class?

Should the tutorial text be updated to document that DEFAULT_STORAGE_CLASS environment variable that pipeline/7_get_data_train_upload.py consumes? Checking https://rh-aiservices-bu.github.io/fraud-detection/fraud-detection-workshop/running-a-pipeline-generated-from-python-code.html it does not mention storage classes either.

adelton commented 3 months ago

Data point: removing that storageClassName: gp3 from the YAML and reimporting the pipeline makes the get-data task pass.

erwangranger commented 3 months ago

@rcarrata , I know you were going to pass through this content. If you happen to run through this one, can you test and update this change?

rh-aiservices-bu / fraud-detection

The `7_get_data_train_upload.py` generated pipeline gets stuck on ROSA-hosted OCP cluster #22