tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

LLaMA 13B works on a single RTX 4080 16GB #17

Open kcchu opened 1 year ago

kcchu commented 1 year ago

https://github.com/facebookresearch/llama/issues/79#issuecomment-1465779961

System:

LLaMA 13B

image

LLaMA 7B

chrisbward commented 1 year ago

Using the above methods on 3090 Ti 24GB;

LLaMA 13B - 30 seconds loading (with swap - 50GB), 30 seconds inference