kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.57k stars 686 forks source link

Add workflows to verify if examples are valid #2014

Open tenzen-y opened 6 months ago

tenzen-y commented 6 months ago

We have many examples, and these allow users to understand easily how to perform TrainingJobs. However, we don't have any verifications if the examples are valid. So, I would propose that we add CI workflows to verify that examples are working.

Katib workflows would be good examples to implement in the training-operator: https://github.com/kubeflow/katib/blob/master/.github/workflows/e2e-test-pytorch-mnist.yaml

/good-first-issue

google-oss-prow[bot] commented 6 months ago

@tenzen-y: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/training-operator/issues/2014): >We have many examples, and these allow users to understand easily how to perform TrainingJobs. >However, we don't have any verifications if the examples are valid. So, I would propose that we add CI workflows to verify that examples are working. > >Katib workflows would be good examples to implement in the training-operator: https://github.com/kubeflow/katib/blob/master/.github/workflows/e2e-test-pytorch-mnist.yaml > >/good-first-issue > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
shivas1516 commented 6 months ago

I'd like to work on this GitHub Action for the training operator examples issue. It matches my difficulty level. Any guidance you can provide would be greatly appreciated and will help me proceed forward faster.

/assign

shivas1516 commented 6 months ago

@tenzen-y Are adding e2e tests in workflow necessary for verifying Training Operator examples like in Katib? Can you provide some additional information to this. it helps me to solve this issue

tenzen-y commented 4 months ago

@tenzen-y Are adding e2e tests in workflow necessary for verifying Training Operator examples like in Katib? Can you provide some additional information to this. it helps me to solve this issue

We need to implement the following steps in the script:

  1. Build example and operator images
  2. Start KinD cluster
  3. Load built images into the cluster
  4. Set up the TrainingOperator
  5. Create a Job with built images
  6. Verify if a created Job succeeded.
github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 1 month ago

/remove-lifecycle stale