LLaMA 13B works on a single RTX 4080 16GB

tloen / llama-int8

Quantized inference code for LLaMA models

GNU General Public License v3.0

1.05k stars 105 forks source link

Open kcchu opened 1 year ago

kcchu commented 1 year ago

System:

It uses > 32 GB of host memory when loading and quantizing, be sure you have enough memory or swap
VRAM usage: about 15GB
loading time: 5 min (using swap)
inference time: 30s

chrisbward commented 1 year ago

Using the above methods on 3090 Ti 24GB;

LLaMA 13B - 30 seconds loading (with swap - 50GB), 30 seconds inference