tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

Does 8GB able to run smallest llama model? #5

Open lucasjinreal opened 1 year ago

lucasjinreal commented 1 year ago

Does 8GB able to run smallest llama model?

dylancvdean commented 1 year ago

No, 7B uses 8600 MB of VRAM

ubik2 commented 1 year ago

I was able to run 7B with a max_batch_size=1 in 7.12 GiB With a max_batch_size=4, it used 7.87GiB. Decreasing max_seq_len may allow for higher batch size.

lucasjinreal commented 1 year ago

@ubik2 does there any inference result, with int8?

ubik2 commented 1 year ago

@ubik2 does there any inference result, with int8?

I was able to generate text responses based on the example prompt. The quality may not be that great, though.