Open andreyvelich opened 11 months ago
I will create a proposal in couple of weeks regarding the new API to be supported.
@johnugeorge @andreyvelich I'm not sure why we need to support this feature. I think we can realize the existing features using Kubeflow Pipelines. Maybe we can construct the following pipeline:
So, I think the role of the training-operator would conflict with pipelines. What is the difference between using pipelines and this new training-operator feature?
@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me
@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me
I synced my thoughts with @johnugeorge offline. So, I agreed to support this feature on the training-operator side by expanding our SDK.
Just to add to my point about "Streamline access to training data". I think, we need to discuss various capabilities to access data on Training Workers from Data Preparation step (e.g. using Spark). Sometimes PVC might not be enough, since it should support ReadWriteMany
Access Mode to read it in multiple MicroServices (e.g. Workers).
For example, we can investigate how PyArrow can help us in Kubernetes to get data from Spark DataFrames.
Also, some additional resources can be found in this talk: https://pt.slideshare.net/databricks/simplify-data-conversion-from-spark-to-tensorflow-and-pytorch
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
/remove-help
/assign @johnugeorge @deepanker13
@tenzen-y: GitHub didn't allow me to assign the following users: deepanker13.
Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
Today, in the world of large models, usually, Data Scientists don't train their models from scratch but use the existing Foundation models and fine-tune them. In the last Training WG Community call, we discussed how Training Operator can be used to efficient fine-tune large models on Kubernetes.
There are several challenges that we can address in Training Operator to improve it.
Streamline access to training data
Usually, training of large model requires many GPUs and Workers. Which means every Worker should have an access to training data before it starts. If data is large, it requires significant amount of CPU resources to download the data and convert it to PyTorch DataLoader as an example. We can discuss about improvements of the data transfer from data pre-processing step (e.g. using Spark) to training step (e.g. using PyTorch, Tensorflow, etc.).
Optimize model download
Before training starts, we need to download the model on every worker. Maybe we could think how to reduce cost and resources for such operation.
Quick access to Foundation models
We can build abstractions on top of HuggingFace Transformers APIs to give users quick access to fine-tune foundation models on Kubernetes using Training Operator SDK. For example:
And SDK will generate appropriate script to the Job's container arguments by using HuggingFace APIs.
Avoid Overfitting
Sometime model could be overtrained which means model accuracy will decrease and model could forget some features. Especially, that is important when you want to deploy model which produces best results. We can address such issue by using some EarlyStopping techniques, similar what we do in Katib.
Using MPI/All-reduce style of distributed training
We need to benchmark whether all-reduce style of distributed training produces better results to train large models. Then, MPI Operator could be a good candidate to investigate. In addition to that, we can explore other distributed techniques that improve training performance.
Feedback from Training Operator users
We want to hear feedback from Training Operator users and what features they would like to see to train their large models. Please provide your ideas, suggestions, and feature-request for that topic.
cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing