kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 698 forks source link

Training Operator ROADMAP 2024 #2259

Open andreyvelich opened 2 months ago

andreyvelich commented 2 months ago

We should update the Training Operator ROADMAP with 2024 work items.

Let's discuss it during the upcoming Training WG calls. Some initial ideas:

cc @kubeflow/wg-training-leads @franciscojavierarceo @alculquicondor @kannon92 @mimowo @ahg-g @kuizhiqing @Syulin7 @shravan-achar @akshaychitneni @StefanoFioravanzo @vsoch @helenxie-bit @Electronic-Waste

franciscojavierarceo commented 2 months ago

This is awesome @andreyvelich!! Can't wait! 🚀

rocket

StefanoFioravanzo commented 2 months ago

@andreyvelich this is an awesome list!

Would it be possible to draft a user journey mapping and value proposition for each one of these initiatives? I can think of having an umbrella issue for each project that presents it to users. Something similar to what we wrote for the LLM APIs here https://www.kubeflow.org/docs/components/training/explanation/fine-tuning/

Doing this before design and implementation helps us ground the value prop and provides a guideline for the expected result

vsoch commented 2 months ago

This is fantastic work @andreyvelich ! I'll be here along the way to provide the HPC perspective, if needed.

franciscojavierarceo commented 2 months ago

Linking this issue for reference: https://github.com/kubeflow/training-operator/issues/2231

tenzen-y commented 1 month ago

Kubeflow Training JobPipeline Framework Design Brief: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.n3xbuhg2e3vt