kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.55k stars 1.6k forks source link

Not able to use Kubeflow PyTorchJob launcher as a Kubeflow Pipeline component - Always waiting #8051

Closed kanwaljitkhurmi closed 5 months ago

kanwaljitkhurmi commented 2 years ago

Environment

Trying to use Kubeflow PyTorchJob launcher component in the kubeflow pipeline ,however the pipeline component endlessly waits at the main thread with the following logs and does not proceed further with creation of main and worker pods.

Generating job template.
Creating launcher client.
Submitting CR.
Creating kubeflow.org/pytorchjobs pytorch-cnn-dist-file-c3 in namespace kubeflow-user-example-com.
Created kubeflow.org/pytorchjobs pytorch-cnn-dist-file-c3 in namespace kubeflow-user-example-com.
Monitoring job until status is any of ['Succeeded', 'Failed'].

Code:

!pip install kfp==1.8.4

pytorch_job_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/pytorch-launcher/component.yaml')

@dsl.pipeline(name="PyTorch Training pipeline", description="Sample training job test")
def pytorch_cnn_n_b_yaml(
    namespace=kanwal_namespace
):

    train_task = pytorch_job_op(
        name='pytorch-cnn-dist-file-c3', 
        namespace='kubeflow-user-example-com', 
        master_spec='{ \
          "replicas": 1, \
          "restartPolicy": "OnFailure", \
          "template": { \
            "metadata": { \
              "annotations": { \
                "sidecar.istio.io/inject": "false" \
              } \
            }, \
            "spec": { \
              "containers": [ \
                { \
                  "name": "pytorch1", \
                  "image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu116-ubuntu20.04-e3", \
                  "args": [ \
                    "python", \
                    "./sc-claim-dlc/mnist.py", \
                    "--epochs", "5", \
                    "--seed", "7", \
                    "--log-interval", "60" \
                  ], \
                  "resources": { \
                    "limits": { \
                      "nvidia.com/gpu": 2 \
                    } \
                  }, \
                  "volumeMounts": [ \
                    { \
                      "mountPath": "/sc-claim-dlc", \
                      "name": "sc-claim-dlc" \
                    } \
                  ] \
                } \
              ], \
              "volumes": [ \
                { \
                  "name": "sc-claim-dlc", \
                  "persistentVolumeClaim": { \
                    "claimName": "sc-claim-dlc" \
                  } \
                } \
              ] \
            } \
          } \
        }', 
        worker_spec='{ \
          "replicas": 1, \
          "restartPolicy": "OnFailure", \
          "template": { \
            "metadata": { \
              "annotations": { \
                "sidecar.istio.io/inject": "false" \
              } \
            }, \
            "spec": { \
              "containers": [ \
                { \
                  "name": "pytorch2", \
                  "image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu116-ubuntu20.04-e3", \
                  "args": [ \
                    "python", \
                    "./sc-claim-dlc/mnist.py", \
                    "--epochs", "5", \
                    "--seed", "7", \
                    "--log-interval", "60" \
                  ], \
                  "resources": { \
                    "limits": { \
                      "nvidia.com/gpu": 1 \
                    } \
                  }, \
                  "volumeMounts": [ \
                    { \
                      "mountPath": "/sc-claim-dlc", \
                      "name": "sc-claim-dlc" \
                    } \
                  ] \
                } \
              ], \
              "volumes": [ \
                { \
                  "name": "sc-claim-dlc", \
                  "persistentVolumeClaim": { \
                    "claimName": "sc-claim-dlc" \
                  } \
                } \
              ] \
            } \
          } \
        }',
        delete_after_done=False
    )

Can you help ?

Expected result

Materials and reference

Labels


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

kanwaljitkhurmi commented 2 years ago

anyone else faced this issue ?

zijianjoy commented 2 years ago

/cc @jagadeeshi2i Would you like to help with this issue? Thank you!

jagadeeshi2i commented 2 years ago

@kanwaljitkhurmi did the launcher start worker and master pods ? Can you share the logs or describe the pod.

I could launch pytorch job for the example - https://github.com/kubeflow/pipelines/blob/master/samples/contrib/pytorch-samples/Pipeline-Bert-Dist.ipynb

image

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rimolive commented 5 months ago

Closing this issue. No activity for more than a year.

close

rimolive commented 5 months ago

/close

google-oss-prow[bot] commented 5 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/8051#issuecomment-2016973308): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.