Open bash99 opened 1 year ago
GGML - not yet - https://github.com/ggerganov/llama.cpp/issues/247 GPTQ - not really - you can quantize but it is not very good - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/157
You can try the default quantization method in FastChat (https://github.com/lm-sys/FastChat#no-enough-memory), but I haven't tested it so you probably need to fix some potential bugs.
@bradfox2 Regarding GPTQ, is the performance degeneration specific to T5 or to all LLMs?
@zhisbug AFAIK just T5
@merrymercy I tried the one in FastChat. It caused inf/nan element in the final output, will need to dig into it more.
The following converted and quantized model which run on cpu only should be helpful: https://huggingface.co/limcheekin/fastchat-t5-3b-ct2
CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.
CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.
Appreciate if you could publish the evaluation metrics for CT2 vs GPTQ comparison. Kindly share what other 'degradation free' methods available here, it will benefit everyone following the thread.
Thanks.
Is there anyway to run it in 4G or less vram?
ggml? or gptq?