Add more AI/ML Training Examples

andreyvelich commented 8 months ago

As we discussed previously: https://github.com/kubeflow/training-operator/pull/2021#issuecomment-1987733922 we want to add more AI/ML examples to the Kubeflow Training Operator. Right now, most of our examples have very basic and simple CNN training for MNIST. Since Training Operator is capable to train large-scale ML models, we would like to contribute more AI/ML use-cases.

We can make these examples Data Scientists friendly and re-use our Python SDK within Jupyter Notebooks to simplify the user submission. I like the example structure of HF Transformers, so I propose the following path: examples/<framework>/<ml-use-case>

We can start with these examples (feel free to add more ML use-cases in this issue):

[x] Language Modeling
[x] Image Classification
[x] Text Classification
[ ] Audio Classification
[ ] Question Answering
[ ] Speech Recognition
[ ] Text Generation
[x] FSDP Example with PyTorch

We should investigate how to configure our CI/CD to make sure that these examples are functional.

cc @kuizhiqing @johnugeorge @tenzen-y @kubeflow/wg-training-leads

/help /good-first-issue /area example

google-oss-prow[bot] commented 8 months ago

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/training-operator/issues/2040): >As we discussed previously: https://github.com/kubeflow/training-operator/pull/2021#issuecomment-1987733922 we want to add more AI/ML examples to the Kubeflow Training Operator. Right now, most of our examples have very basic and simple CNN training for MNIST. Since Training Operator is capable to train large-scale ML models, we would like to contribute more AI/ML use-cases. > >We can make these examples Data Scientists friendly and re-use our Python SDK within Jupyter Notebooks to simplify the user submission. >I like the example structure of [HF Transformers](https://github.com/huggingface/transformers/tree/main/examples), so I propose the following path: `examples//` > >We can start with these examples (feel free to add more ML use-cases in this issue): > >- [x] Language Modeling >- [x] Image Classification >- [x] Text Classification >- [ ] Audio Classification >- [ ] Question Answering >- [ ] Speech Recognition >- [ ] Text Generation > > >**We should investigate how to configure our CI/CD to make sure that these examples are functional.** > >cc @kuizhiqing @johnugeorge @tenzen-y > >/help >/good-first-issue >/area example Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

xr-dev-saurabh commented 8 months ago

/assign

StefanoFioravanzo commented 7 months ago

@andreyvelich I love this. Few thoughts:

Whenever we publish a new example, please reach out to me or Amber so that we can help turning it into either a short blog post or at least disseminate via social media.
How do you define the actual use case for these topics?
Are these examples supposed to be specific to the training operator or were you thinking of a wider applicability (serving, tuning, metadata, etc.)

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 4 months ago

/remove-lifecycle stale

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 1 month ago

/remove-lifecycle stale

snax-07 commented 2 weeks ago

/assign

kubeflow / training-operator

Add more AI/ML Training Examples #2040