CUDA out of memory on flan-ul2

qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ

Apache License 2.0

2.98k stars 457 forks source link

CUDA out of memory on flan-ul2 #265

Closed sigmareaver closed 1 year ago

sigmareaver commented 1 year ago

Tested on 4090 Using command: python t5.py ../full-models/flan-ul2 c4 --wbits 4 --act-order --groupsize 128 --save ../gptq-models/flan-ul2-gptq/flan-ul2-4bit-128g-gptq.pt What is the memory requirement for quantizing a 20b model? I thought it should only need one layer at a time on GPU?

sigmareaver commented 1 year ago

Was able to quantize using --nsamples 256 and hacking a part of the code in t5_sequential, the part about final layer norms and dropout, to be run on CPU.