kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

KEP-2170: Design Trainer for the LLM Runtimes #2321

Open andreyvelich opened 2 weeks ago

andreyvelich commented 2 weeks ago

As part of Kubeflow Training V2 work, we should design and implement custom Trainer to fine-tune LLMs that we are planning to support via TrainingRuntimes in Kubeflow upstream.

We should discuss whether we should use native PyTorch APIs or HuggingFace Transformers in the LLM Trainer implementation.

The Trainer should allow users to configure LoRA, QLoRA, FSDP, and other important configurations.

Useful resources:

Part of: https://github.com/kubeflow/training-operator/issues/2170

cc @saileshd1402 @deepanker13 @kubeflow/wg-training-leads

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich commented 1 week ago

/assign @saileshd1402

We are experimenting with some PyTorch-native and Transformers APIs to design this Trainer.

google-oss-prow[bot] commented 1 week ago

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubeflow/training-operator/issues/2321#issuecomment-2465896247): >/assign @saileshd1402 > >We are experimenting with some PyTorch-native and Transformers APIs to design this Trainer. > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
saileshd1402 commented 1 week ago

/assign