kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.57k stars 682 forks source link

Fine-tuning in Kubeflow Training Operator #1923

Open andreyvelich opened 11 months ago

andreyvelich commented 11 months ago

Today, in the world of large models, usually, Data Scientists don't train their models from scratch but use the existing Foundation models and fine-tune them. In the last Training WG Community call, we discussed how Training Operator can be used to efficient fine-tune large models on Kubernetes.

There are several challenges that we can address in Training Operator to improve it.

Streamline access to training data

Usually, training of large model requires many GPUs and Workers. Which means every Worker should have an access to training data before it starts. If data is large, it requires significant amount of CPU resources to download the data and convert it to PyTorch DataLoader as an example. We can discuss about improvements of the data transfer from data pre-processing step (e.g. using Spark) to training step (e.g. using PyTorch, Tensorflow, etc.).

Optimize model download

Before training starts, we need to download the model on every worker. Maybe we could think how to reduce cost and resources for such operation.

Quick access to Foundation models

We can build abstractions on top of HuggingFace Transformers APIs to give users quick access to fine-tune foundation models on Kubernetes using Training Operator SDK. For example:

TrainingClient().fine_tune(
  model="LLama2",
  dataset="s3://...",
)

And SDK will generate appropriate script to the Job's container arguments by using HuggingFace APIs.

Avoid Overfitting

Sometime model could be overtrained which means model accuracy will decrease and model could forget some features. Especially, that is important when you want to deploy model which produces best results. We can address such issue by using some EarlyStopping techniques, similar what we do in Katib.

Using MPI/All-reduce style of distributed training

We need to benchmark whether all-reduce style of distributed training produces better results to train large models. Then, MPI Operator could be a good candidate to investigate. In addition to that, we can explore other distributed techniques that improve training performance.

Feedback from Training Operator users

We want to hear feedback from Training Operator users and what features they would like to see to train their large models. Please provide your ideas, suggestions, and feature-request for that topic.

cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing

johnugeorge commented 11 months ago

I will create a proposal in couple of weeks regarding the new API to be supported.

tenzen-y commented 11 months ago

@johnugeorge @andreyvelich I'm not sure why we need to support this feature. I think we can realize the existing features using Kubeflow Pipelines. Maybe we can construct the following pipeline:

  1. Download Model to PVC.
  2. Do any pre-processing to the downloaded mode.
  3. Start fine-tuning using the training-operator with PVC.

So, I think the role of the training-operator would conflict with pipelines. What is the difference between using pipelines and this new training-operator feature?

johnugeorge commented 11 months ago

@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me

tenzen-y commented 11 months ago

@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me

I synced my thoughts with @johnugeorge offline. So, I agreed to support this feature on the training-operator side by expanding our SDK.

andreyvelich commented 11 months ago

Just to add to my point about "Streamline access to training data". I think, we need to discuss various capabilities to access data on Training Workers from Data Preparation step (e.g. using Spark). Sometimes PVC might not be enough, since it should support ReadWriteMany Access Mode to read it in multiple MicroServices (e.g. Workers). For example, we can investigate how PyArrow can help us in Kubernetes to get data from Spark DataFrames. Also, some additional resources can be found in this talk: https://pt.slideshare.net/databricks/simplify-data-conversion-from-spark-to-tensorflow-and-pytorch

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 8 months ago

/remove-lifecycle stale

tenzen-y commented 8 months ago

/remove-help

tenzen-y commented 8 months ago

/assign @johnugeorge @deepanker13

google-oss-prow[bot] commented 8 months ago

@tenzen-y: GitHub didn't allow me to assign the following users: deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubeflow/training-operator/issues/1923#issuecomment-1883769982): >/assign @johnugeorge @deepanker13 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 5 months ago

/remove-lifecycle stale

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 2 months ago

/lifecycle frozen