a questuon about the single GPU Inference

juncongmoo / pyllama

LLaMA: Open and Efficient Foundation Language Models

GNU General Public License v3.0

2.81k stars 311 forks source link

a questuon about the single GPU Inference #74

Open TitleZ99 opened 1 year ago

TitleZ99 commented 1 year ago

Thanks for this great job and i'm wondering how to run inference in a 8GB single GPU,like your example showing in the readme. I tried it in my RTX2080ti with 11GB and the result is CUDA out of memory.

tanglaoya321 commented 1 year ago

Same problem, a single gpu, in the case of no quantization, it should be the need for 4*7B=28GB memory