kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

KEP-2170: Create model exporter for checkpointing and training output #2245

Open andreyvelich opened 2 months ago

andreyvelich commented 2 months ago

As we discussed before, as part of the Training V2 APIs we want to design and implement model exporter sidecar which helps users to make checkpointing during distributed training and exporting the trained/fine-tuned model: https://github.com/kubeflow/training-operator/pull/2240#issuecomment-2321081416.

/area storage

saileshd1402 commented 1 month ago

/assign