kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.45k stars 426 forks source link

Tuning API in Katib for LLMs #2291

Open andreyvelich opened 3 months ago

andreyvelich commented 3 months ago

Recently, we implemented a new train Python SDK API in Kubeflow Training Operator to easily fine-tune LLMs on multiple GPUs with predefined datasets provider, model provider, and HuggingFace trainer.

To continue our roadmap around LLMOps in Kubeflow, we want to give user functionality to tune HyperParameters of LLMs using simple Python SDK APIs: tune. It requires to make appropriate changes to the Katib Python SDK which allows users to set model, dataset, and HyperParameters that they want to optimize for LLM. We need to re-use existing Training Operator components that we used for train API: storage-initializer, trainer.

tariq-hasan commented 3 months ago

I presume that the initiative here is motivated by the recent trend in the ML space to fine-tune pre-trained models (LLMs or otherwise) using custom datasets instead of training bare models from scratch.

This requires that the interface provided to users for training and hyperparameter tuning needs to be enriched.

Training (training-operator):

The train function takes the following arguments and is essentially an abstraction over the create_job function that enables model fine-tuning.

trainingClient.train(
   num_workers=1,
   num_procs_per_worker = 1,
   resources_per_worker={"gpu": "2", "cpu":8, "memory": "16Gi"},
   HuggingFaceModelParams(model='hf://openchat/openchat_3.5', access_token = "hf_..." ),
   S3DatasetParams(dataset= 's3://doc-example-bucket1/train_dataset', eval_dataset = "s3://doc-example-bucket1/eval_dataset", access_token = "s3 access token", region="us-west-2"),
   HuggingFaceTrainParams(learning_rate=0.1, transformerClass="Trainer", peft_config = {})
)

Hyperparameter tuning (Katib):

Taking inspiration from the design in training-operator I would think that the higher-level interface in Katib would be an abstraction over the tune function and would still allow users to specify function parameters such as hyperparameters, algorithm name, evaluation metric, etc. but that the objective function would be replaced by model provider and dataset provider.

katib_client.tune(
    name=exp_name,
    objective=train_mnist_model, # Objective function.
    parameters=parameters, # HyperParameters to tune.
    algorithm_name="cmaes", # Algorithm to use.
    objective_metric_name="accuracy", # Katib is going to optimize "accuracy".
    additional_metric_names=["loss"], # Katib is going to collect these metrics in addition to the objective metric.
    max_trial_count=12, # Trial Threshold.
    parallel_trial_count=2,
)

I presume then that the difference with this example implementation would just be that _train_mnistmodel is replaced with a model provider and a dataset provider that forms the basis for the hyperparameter tuning.

tariq-hasan commented 3 months ago

Having worked through the Python SDK and examples for training operator and Katib I have further ideas on an appropriate implementation of the tuning API in Katib for LLMs.

It appears that the current implementation of the tune API for Katib Python SDK leverages a mandatory objective function to define the trial specification as a batch job. That said the higher-level interface to the existing API for Katib is meant to fine-tune pre-trained models on custom datasets.

The following are some important points to note:

Following the example implementation of a Katib experiment using PyTorchJob we would therefore need to modify the tune API to take in either a combination of objective and parameters or a combination of _model_providerparameters, _dataset_providerparameters and _trainparameters.

In the former case the code would default to defining a Katib experiment using a batch job in the trial specification. In the latter case the code would define a Katib experiment using a PyTorchJob in the trial specification. This PyTorchJob would define an init container and an app container for the master and use the same app container for the workers.

andreyvelich commented 1 month ago

/assign @helenxie-bit

google-oss-prow[bot] commented 1 month ago

@andreyvelich: GitHub didn't allow me to assign the following users: helenxie-bit.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubeflow/katib/issues/2291#issuecomment-2130241181): >/assign @helenxie-bit Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.