kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 700 forks source link

KEP-2170: Create LLM training runtime for Llama 3.1 8B #2212

Open andreyvelich opened 3 months ago

andreyvelich commented 3 months ago

Related: https://github.com/kubeflow/training-operator/issues/2170

Once we implement storage initializers, trainers, and controllers, we should add the LLM training runtimes. We can start with runtime for Llama 3.1 8B.

https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

/area runtime

Electronic-Waste commented 1 day ago

/assign

I can help with this. Please let me know if you have different plans @kubeflow/wg-training-leads .

andreyvelich commented 1 day ago

Thank you, Shao! However, we need to work on the LLM Trainer before we add the post-training runtimes: https://github.com/kubeflow/training-operator/issues/2321

Electronic-Waste commented 1 day ago

Thanks for pointing this out, Andrey!

Shall I unassign myself since this issue is related to #2321 ?

andreyvelich commented 1 day ago

If you could also help us with #2321 that would be great! We have a few ideas with @saileshd1402, but we still investigate on how we can build that Trainer to support different LLMs and datasets.

Electronic-Waste commented 23 hours ago

Sure, I'm glad to hear that I can help with #2321 !