qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ
Apache License 2.0
2.98k stars 457 forks source link

Fix multigpu #251

Closed LQ1234 closed 1 year ago

LQ1234 commented 1 year ago

Fixes multigpu and adds a generation test.

Expected:

> CUDA_VISIBLE_DEVICES=0,1,2,3 python llama.py /media/ssd/LLaMA/65B-hf/llama-65b --wbits 4 --groupsize 128 --load /media/ssd/LLaMA/65B-quantized/llama65b-4bit-128g-new.safetensors --layers-dist 24:24:10:12 --test-generation wikitext2
...
The capital of New Mexico is Santa Fe.