kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.58k stars 688 forks source link

Export Fine-Tuned LLM after Trainer is Complete #2101

Open andreyvelich opened 5 months ago

andreyvelich commented 5 months ago

We discussed here: https://github.com/kubeflow/website/pull/3718#issuecomment-2096619898 that our LLM Trainer doesn't export the fine-tuned model. So user can't re-use that model for inference or other purposes.

We should discuss how user can get the fine-tuned artifact after LLM Trainer is complete. /cc @kubeflow/wg-training-leads @deepanker13

Would be nice to see integration with Kubeflow Model Registry as well. cc @kubeflow/wg-data-leads

tarilabs commented 5 months ago

Would be nice to see integration with Kubeflow Model Registry as well. cc @kubeflow/wg-data-leads

If there is a tutorial of the part specific to this project that exhibit the metadata we want to capture on Model Registry, I would be very happy to complement that example with indexing those metadata on MR ! 🚀👍

StefanoFioravanzo commented 5 months ago

@andreyvelich I may have misunderstood the initial context of this API because I was under the impression that you could serve the model once fine-tuned. Can you elaborate on this?

So user can't re-use that model for inference or other purposes.

andreyvelich commented 5 months ago

@andreyvelich I may have misunderstood the initial context of this API because I was under the impression that you could serve the model once fine-tuned. Can you elaborate on this?

So user can't re-use that model for inference or other purposes.

I think, right now the only way is to use output_dir for model checkpoints. In that case, user can get the model from PVC that we attach to the PyTorchJob. Like in this example: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb Right @johnugeorge @deepanker13 ?

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 2 months ago

/remove-lifecycle stale

tarilabs commented 2 months ago

per https://github.com/kubeflow/training-operator/issues/2101#issuecomment-2097204327 is there a tutorial/demo about this, please?

I would be very happy to integrate a demo/blueprint for the documentation, I just need a "seed" to get started on the training operator :) thanks!