meta-llama / llama

Inference code for Llama models
Other
54.17k stars 9.32k forks source link

how to run the largest possible model on a single A100 80Gb #101

Closed huey2531 closed 10 months ago

huey2531 commented 1 year ago

I was able to get the 7B model to work. It looks like I might be able to run the 33B version? Will I need to merge the checkpoint files (.pth) to run on a single GPU? and set MP = 1?

It would be great if FAIR could provide some guidance on vram requirements

pauldog commented 1 year ago

I had an idea that you might use GPU virtualisation to make it appear like you had two GPUs. Well, it wasn't my idea, it was Chat GPTs.

cedrickchee commented 1 year ago

It would be great if FAIR could provide some guidance on vram requirements

See: memory requirements for each model size.

It's a small place that's part of my own experiments where I also document down things I learned. While the official documentations are lacking now, you can also learn from the good discussions around this project on various GitHub Issues.

If I'm not wrong, 65B needs a GPU cluster with a total of 250GB in fp16 precision or half in int8.

(I'm not affiliated with FAIR.)

benob commented 1 year ago

I had an idea that you might use GPU virtualisation to make it appear like you had two GPUs. Well, it wasn't my idea, it was Chat GPTs.

Seems that torch.distributed does not support GPU virtualization.

benob commented 1 year ago

I was able to run the 13B and 30B (batch size 1) models on a single A100-80GB. I used a script [1] to reshard the models and torchrun with --nproc_per_node 1.

[1] https://gist.github.com/benob/4850a0210b01672175942203aa36d300

baleksey commented 1 year ago

@benob Thanks for the script! I was successfully run 13B with it. But I was failed while sharding 30B model as I run our of memory (128 RAM is obviously not enough for this). Is it any way you can share your combined 30B model so I can try to run it on my A6000-48GB? Thank you so much in advance!

USBhost commented 1 year ago

@benob Thanks for the script! I was successfully run 13B with it. But I was failed while sharding 30B model as I run our of memory (128 RAM is obviously not enough for this). Is it any way you can share your combined 30B model so I can try to run it on my A6000-48GB? Thank you so much in advance!

For one can't even run the 33B model in 16bit mod. You would need another 16GB+ of vram.

But using the 8bit fork you can load it without needing to reshard it.

baleksey commented 1 year ago

@USBhost Got it, thanks for clarification! Can you direct me to 8bit model/code for 33B l so I can continue my experiments? :)

USBhost commented 1 year ago

@USBhost Got it, thanks for clarification! Can you direct me to 8bit model/code for 33B l so I can continue my experiments? :)

https://github.com/oobabooga/text-generation-webui/issues/147#issuecomment-1454850191

I'm also a fellow A6000 user.

baleksey commented 1 year ago

@USBhost Thank you! Have you managed to run 33B model with it? I still have OOMs after model quantization..

USBhost commented 1 year ago

@USBhost Thank you! Have you managed to run 33B model with it? I still have OOMs after model quantization..

I'm using ooba python server.py --listen --model LLaMA-30B --load-in-8bit --cai-chat

If you just want to use LLaMA-8bit then only run with node 1.

MarkSchmidty commented 1 year ago

It would be great if FAIR could provide some guidance on vram requirements

See: memory requirements for each model size.

It's a small place that's part of my own experiments where I also document down things I learned. While the official documentations are lacking now, you can also learn from the good discussions around this project on various GitHub Issues.

If I'm not wrong, 65B needs a GPU cluster with a total of 250GB in fp16 precision or half in int8.

(I'm not affiliated with FAIR.)

The linked memory requirement calculation table is adding the wrong rows together, I think. The corrected table should look like:

Memory requirements in 8-bit precision: Model (on disk)*** 13 24 60 120
Memory Requirements (GB) 6.7 13 32.5 65.2
Cache 1 2 3 5
Total 7.7 16 35.5 70.2

This seems to more closely match up with what I'm seeing people report their actual VRAM usage is in https://github.com/oobabooga/text-generation-webui/issues/147

For example: "LLaMA-7B: 9225MiB" "LLaMA-13B: 16249MiB" "The 30B uses around 35GB of vram at 8bit."

If this is true then 65B should fit on a single A100 80GB after all.

cedrickchee commented 1 year ago

@MarkSchmidty

The linked memory requirement calculation table is adding the wrong rows together Not exactly.

The corrected table should look like: ...

Yours looks almost correct. I have updated the table based on your findings (a copy below).

Memory requirements in 8-bit precision:

To prevent all sort of confusion, let's keep the precision in fp16 (before 8-bit quantization).

I need to point out that when people report their actual VRAM, they never state the model arguments. The most important ones are max_batch_size and max_seq_length. These impact the VRAM required (too large, you run into OOM. FAIR should really set the max_batch_size to 1 by default. It's 32 now.)

Based on the Transformer kv cache formula and max_batch_size of 1 and max_seq_length of 1024, the table looks like this now:

Memory requirements in fp16 precision.

Model Params (Billions) 6.7 13 32.5 65.2
Model on disk (GB)*** 13 26 65 130
Cache (GB) 1 1 2 3
Total (GB) 14 27 67 133

This seems to more closely match up with what I'm seeing people report their actual VRAM usage

Sounds right to me. I have also check the above model mem. requirements against GPU requirements (from my repo). The two closely match up now.

If this is true then 65B should fit on a single A100 80GB after all.

133GB in fp16, 66.5GB in int8. Your estimate is 70.2GB. I guess the small diff should be rounding error. 66.5GB VRAM is within the range of 1x A100 80GB. So, should fit ~but can't confirm though.~ It's true now. More people got it working. One example is this.

I'm running LLaMA-65B on a single A100 80GB with 8bit quantization. ... The output is at least as good as davinci.

MarkSchmidty commented 1 year ago

4-bit for LLaMA is underway https://github.com/oobabooga/text-generation-webui/issues/177

65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model.

Int4 LLaMA VRAM usage is aprox. as follows: Model Params (Billions) 6.7 13 32.5 65.2
Model on disk (GB) 13 26 65 130
Int4 model size (GB) 3.2 6.5 16.2 32.5
Cache (GB) 1 2 2 3
Total VRAM Used(GB) 4.2 8.5 18.2 35.5

(30B should fit on 20GB+ cards and 7B will fit on cards under 8 or even 6GB if they support int8.)

cedrickchee commented 1 year ago

4-bit for LLaMA

I'm well aware. It's a matter of time. This 4-bit post-training quantization technique is different the one that Tim Dettmers working on.

65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model.

Exciting but the big question is, does this technique reduce the model performance (both inference throughput and accuracy)? To my understanding, I couldn't find legible results that prove the model performance closely match LLM.int8(). Furthermore, the custom CUDA kernel implementation just make the deployment harder by a large margin. I can wait a little longer for bitsandbytes 4-bit . There are a lot more to GPTQ and bitsandbytes like SmoothQuant, etc. I actually maintained a list of possible model compression and acceleration papers and methods here if you're interested.

Disclaimer: I'm not an expert in this area.

Int4 LLaMA VRAM usage is aprox. as follows:

Thanks for the heads-up. I will update and add this to my repo accordingly.

MarkSchmidty commented 1 year ago

That particular 4-bit implementation is mostly a proof of concept at this point. Bitsandbytes may be getting some 4-bit functionality towards the end of this month. Best to wait for that.

amitsangani commented 10 months ago

Old issue. With Llama 2 you should be able to run/inference the Llama 70B model on a single A100 GPU with enough memory.