shm007g / LLaMA-Cult-and-More

Large Language Models for All, 🦙 Cult and More, Stay in touch !
https://shm007g.github.io/LLaMA-Cult-and-More/
MIT License
425 stars 24 forks source link

parallel training and param efficient #4

Open shm007g opened 1 year ago

shm007g commented 1 year ago

Typology of Efficient Training

shm007g commented 1 year ago

Data and Model Parallel

image https://openai.com/research/techniques-for-training-large-neural-networks

Start from here

torch-model-parallel-tutorial: speed up training by 50% using Pipeline Parallel, for it solves the GPU idling problem in Naive Model Parallel (Vertical).

50%speedup

First, split model into 2 parts on 2 gpus. One gpu do its own part of work. One gpu wait the other gpu finish its batch. It contains lots of idle gpu time. This is called Naive Model Parallel (Vertical).

native-model-parallel-code

Then, we split the whole batch in smaller batch-splits. GPU0 finish one small batch-split and then GPU1 start its task as well. Through split big batch into small, could reduce the whole ilde gpu time and raise the training speed.

pipeline-parallel-code

We call this upgrade as Pipeline Parallel. pipeline-parallel-smaller-bubble-time

Model Parallel Review

huggingface-model-parallelism | new, In the modern machine learning the various approaches to parallelism are used to:

  1. fit very large models onto limited hardware - e.g. t5-11b is 45GB in just model params
  2. significantly speed up training - finish training that would take a year in hours

Concepts 1. DataParallel (DP) - the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step. 2. TensorParallel (TP) - each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu. During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step. This is what one may call horizontal parallelism, as the splitting happens on horizontal level. 3. PipelineParallel (PP) - the model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu. Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch. 4. Zero Redundancy Optimizer (ZeRO) - Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn’t need to be modified. It also supports various offloading techniques to compensate for limited GPU memory. 5. Sharded DDP - is another name for the foundational ZeRO concept as used by various other implementations of ZeRO.

DataParallel (DP)

Built-in feature of Pytorch(rank, torchrun): torch-parallel-training, torch-ddp-youtube

DP vs DDP: https://huggingface.co/docs/transformers/v4.28.1/en/perf_train_gpu_many#dp-vs-ddp

Sync required.

image

Naive Model Parallel (Vertical) and Pipeline Parallel

Like tutorial before, when model is too big to fit in one GPU. It's easy to slice it in several parts vertically. Problems:

Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the GPU idling problem, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process.

pp-bubble

PP introduces a new hyper-parameter to tune and it’s chunks which defines how many chunks of data are sent in a sequence through the same pipe stage. (Pytorch uses chunks, whereas DeepSpeed refers to the same hyper-parameter as GAS.)

Because of the chunks, PP introduces the concept of micro-batches (MBS). DP splits the global data batch size into mini-batches, so if you have a DP degree of 4, a global batch size of 1024 gets split up into 4 mini-batches of 256 each (1024/4). And if the number of chunks (or GAS) is 32 we end up with a micro-batch size of 8 (256/32). Each Pipeline stage works with a single micro-batch at a time.

Problems with traditional Pipeline API solutions:

Support in Pytorch, FairScale(FSDP), DeepSpeed, .etc

DeepSpeed, Varuna and SageMaker use the concept of an Interleaved Pipeline

image

Tensor Parallel (TP)

In Tensor Parallelism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing.

The main building block of any transformer is a fully connected. If we look at the computation in matrix form, it’s easy to see how the matrix multiplication can be split between multiple GPUs:

image

Using this principle, we can update an MLP of arbitrary depth, without the need for any synchronization between GPUs until the very end, where we need to reconstruct the output vector from shards.

image

Parallelizing the multi-headed attention layers is even simpler, since they are already inherently parallel, due to having multiple independent heads!

image

Special considerations:

DeepSpeed calls it tensor slicing

Zero Redundancy Optimizer (ZeRO) Data Parallel

ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this blog post.

image

For simple talk, ZeRO is just the usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it.

And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it.

More detail in zhihu.

DeepSpeed, FairScale(FSDP) support ZeRO-DP stages 1+2+3.

ZeRO-Offload

Offload is a strategy to complement GPU VRAM with CPU RAM for its cheaper.

We want minimal GPU VRAM cost with efficient communication strategy. In offload, FWD/BWD works in GPU VRAM for its big computation. Param update/float2half/optimizer state works in CPU RAM for its smaller computation and big memory cost. Others like activation, buffer, fragment can be reduced using checkpointing.

image image

Problem:


PyTorch Fully Sharded Data Parallel (FSDP)

shm007g commented 1 year ago

2D and 3D parallelism

DP + TP

image

DP + PP + TP

image

DP+PP+TP+ZeRO

shm007g commented 1 year ago

Parameter efficient

a53c40a424ad818b2761f530af1066bc

[LoRA Low Rank Adaptation of Large Language Models, 2021/06, MSFT]

PEFT

https://github.com/huggingface/peft

Accelerate: https://huggingface.co/docs/accelerate/main/en/package_reference/cli#accelerate-launch

also refer to issue 3

Gradient Checkpointing

https://qywu.github.io/2019/05/22/explore-gradient-checkpointing.html

https://openai.com/research/sparse-transformer

fp16 - mixed precision

https://pytorch.org/docs/stable/notes/amp_examples.html

Quantization

Int8 - bitsandbytes / triton

int4 - gptq / ggml