KEP-2170: Create model exporter for checkpointing and training output

kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes

Apache License 2.0

1.62k stars 700 forks source link

Open andreyvelich opened 2 months ago

andreyvelich commented 2 months ago

As we discussed before, as part of the Training V2 APIs we want to design and implement model exporter sidecar which helps users to make checkpointing during distributed training and exporting the trained/fine-tuned model: https://github.com/kubeflow/training-operator/pull/2240#issuecomment-2321081416.

/area storage

saileshd1402 commented 1 month ago

/assign