kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 701 forks source link

[Test] E2e Tests for Notebook Examples #2246

Open Electronic-Waste opened 2 months ago

Electronic-Waste commented 2 months ago

What you would like to be added?

We plan to add e2e tests for notebooks in CI/CD, run with papermill.

REF: https://github.com/nteract/papermill

Why is this needed?

This will help us ensure the correctness of our notebook examples for data scientists.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich commented 2 months ago

/remove-label lifeycycle/needs-triage /area testing

google-oss-prow[bot] commented 2 months ago

@andreyvelich: The label(s) /remove-label lifeycycle/needs-triage cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to [this](https://github.com/kubeflow/training-operator/issues/2246#issuecomment-2324676310): >/remove-label lifeycycle/needs-triage >/area testing Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
andreyvelich commented 2 months ago

/remove-label lifecycle/needs-triage

andreyvelich commented 2 months ago

/good-first-issue

google-oss-prow[bot] commented 2 months ago

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/training-operator/issues/2246): >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
saileshd1402 commented 2 months ago

/assign

andreyvelich commented 2 months ago

cc @shravan-achar @akshaychitneni @bigsur0 for the Papermill usage in Kubeflow Training Operator.

saileshd1402 commented 2 months ago

Hi All! Regarding this issue I am facing a few blockers for some of the notebooks to run e2e tests on. I would like the community's opinions on the same:

  1. Fine-Tune-BERT-LLM.ipynb: This notebook requires GPUs and an AWS S3 Bucket. I am not aware if we can use GPUs in CI and what S3 Buckets we can use.
  2. train_api_hf_dataset.ipynb: I have faced the same problem as pointed out by this issue-https://github.com/kubeflow/training-operator/issues/2176. Should do the change suggested in the issue?
  3. train_api_s3_dataset.ipynb: I am not sure what to use as S3 bucket for this case in CI.

Due to these blockers on some of the files, I wanted to ask if the community thinks it's a good idea to raise separate PRs for each notebook? Please let me know your thoughts.

cc: @andreyvelich @Electronic-Waste @tenzen-y

andreyvelich commented 2 months ago

This notebook requires GPUs and an AWS S3 Bucket. I am not aware if we can use GPUs in CI and what S3 Buckets we can use.

Let's post-pone the E2E for this Notebook. Maybe we can create a simple version of this example with just CPU and without S3 bucket requirement. Like how we did in this PR: https://github.com/kubeflow/training-operator/pull/2199. cc @helenxie-bit

Should do the change suggested in the issue?

Yes, just merged this PR.

train_api_s3_dataset.ipynb: I am not sure what to use as S3 bucket for this case in CI.

@kubeflow/wg-training-leads Any thoughts on how we can test the S3 in our CI ?

Due to these blockers on some of the files, I wanted to ask if the community thinks it's a good idea to raise separate PRs for each notebook? Please let me know your thoughts.

I think, just initially stepping the infrastructure and running just a single simple Notebook would be nice.