tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

Assign the parameters of each layer to multiple CUDA devices automatically. #13

Open lipan6461188 opened 1 year ago

lipan6461188 commented 1 year ago

I implemented a model.Transformer.cuda function to automatically assign the parameters of each layer to detected CUDA devices. This can help to load the 65B model to ≥ 2 40G A100 GPUs with the following command:

CUDA_VISIBLE_DEVICES=0,1 python example.py --ckpt_dir /path/to/model/65B --tokenizer_path /path/to/model/tokenizer.model --max_batch_size=1