tdrussell / qlora-pipe

A pipeline parallel training script for LLMs.
MIT License
79 stars 5 forks source link

qlora-pipe

A pipeline parallel training script for LLMs.

Refer to the changelog at the bottom for details on updates.

About

This is a training script I made so that I can fine-tune LLMs using my workstation with four 4090s. It is developed first and foremost for myself, with my own use cases in mind. It is scrappy and hacked together. It will likely never be a stable, well-supported training script like Axolotl. I am open sourcing the code in case it is useful to others, and also as a proof-of-concept that this kind of thing is possible.

That being said, if something doesn't work right, or you would like it to support some feature, feel free to raise an issue and I'll try to look at it.

Features

Installing

Clone the repository:

git clone --recurse-submodules https://github.com/tdrussell/qlora-pipe

If you alread cloned it and forgot to do --recurse-submodules:

git submodule init
git submodule update

Install Miniconda: https://docs.conda.io/en/latest/miniconda.html

Create the environment

conda create -n training python=3.12
conda activate training

Install Pytorch NIGHTLY version (if before 2.4): https://pytorch.org/get-started/locally/

Until Pytorch is at 2.4, HQQ quantization requires Pytorch nightly if using Python 3.12, because of torch.compile().

Install cuda toolkit (make sure it matches the cuda version you used for Pytorch), e.g.:

conda install nvidia/label/cuda-12.1.1::cuda-toolkit

You could also try installing nvcc on its own, but installing the whole cuda toolkit was always the easiest for me.

Install packaging and ninja first, for flash-attn:

pip install packaging ninja

(Optional) Install flash-attn manually if you want to specify how many jobs are used for compiling:

MAX_JOBS=4 pip install flash-attn --no-build-isolation

Install the dependencies:

pip install -r requirements.txt

Training

Edit the config files in the examples directory to your liking. At minimum, change the paths at the top to point to your model and desired output directory. Per-device batch size and gradient accumulation steps are specified in the Deepspeed JSON config file. Everything else is in the TOML config file. Launch the training script:

NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" deepspeed --num_gpus=1 train.py --deepspeed --deepspeed_config examples/ds_config_7b.json --config examples/config_7b.toml

RTX 4000 series needs those 2 enviroment variables set. Other GPUs may not need them.

Parallelism

Deepspeed handles pipeline- and data-parallelism. Set the --num_gpus flag to however many GPUs to want to use. The config option pipeline_stages determines the level of model parallelism. Then, the data parallelism is automatically set so that all GPUs are used.

For example with 8 GPUs, and pipeline_stages=4, a single instance of the model is divided across 4 GPUs. Because there are 8 GPUs total, there are then 2 data-parallel instances.

The option gradient_accumulation_steps in the Deepspeed JSON config file determines the amount of pipelining when using pipeline parallelism (pipeline_stages>1). The higher the value, the more the GPUs can overlap computation. For example, with gradient_accumulation_steps=1, there is a single batch that gets passed between the GPUs forward, then in reverse for the backward pass. Only 1 GPU is active at a time, the others are idle. As gradient_accumulation_steps increases, you start pipelining multiple forward/backward batches. At the beginning and end of the step, some GPUs will always be idle. So as gradient_accumulation_steps approaches infinity, you approach 100% theoretical utilization. In practice, a value of 8 or so already gives good average utilization with 2 GPUs. With more GPUs, you may want to go higher.

Dataset configuration

There are 3 options for specifying each dataset. Set the dataset_type field to one of:

You can read dataset_utils.py for details on what each of these options is doing.

You can have multiple datasets. Just add additional [[datasets]] entries. When using multiple datasets, there are different ways to combine them.

On sample packing (or the lack thereof)

Sample packing is not currently implemented. Instead, there is the option batch_size_tokens. If this field is set, the batch size in the Deepspeed config file is ignored, and instead the batch size is adjusted dynamically to target a fixed number of tokens per batch, per device. This was easier to implement than sample packing, and does basically the same thing. It is also efficient: if I set batch_size_tokens to a modest 10000 and train a 7B model with the Alpaca dataset, all my 4090s hit their 350W power limit cap. Unless I'm missing something (definitely possible), it seems there is no need to support sample packing.

Floating point precision

There are different places you can specify the floating point dtype. model_weight_dtype controls the precision of the underlying model weights (for any weights not quantized), and lora_weight_dtype is for the lora weights. If you are using quantization, both bnb and hqq have options for the compute dtype as well.

If you are using 16 bit dtypes, floating point roundoff error is a potential problem. For a good overview of the problem and solutions, see Revisiting Bfloat16 Training. TLDR: the main source of precision error when training with 16 bit weights is the weight update step: $(p = p + \Delta p * lr)$. When the update is very small compared to the parameter (which is often the case), there can be significant roundoff error, including the update being entirely dropped. Mixed precision training solves this by keeping a master copy of the weights in fp32, and running all optimizer steps in fp32. Kahan summation is another solution when training in full bf16, that keeps an extra bf16 buffer for each parameter to accumulate roundoff errors so that updates are never dropped.

Okay but how should I configure things?

Changelog

2024-07-02