fastchat-t5 quantization support?

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Apache License 2.0

36.98k stars 4.56k forks source link

fastchat-t5 quantization support? #925

Open bash99 opened 1 year ago

bash99 commented 1 year ago

Is there anyway to run it in 4G or less vram?

ggml? or gptq?

bradfox2 commented 1 year ago

GGML - not yet - https://github.com/ggerganov/llama.cpp/issues/247 GPTQ - not really - you can quantize but it is not very good - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/157

merrymercy commented 1 year ago

You can try the default quantization method in FastChat (https://github.com/lm-sys/FastChat#no-enough-memory), but I haven't tested it so you probably need to fix some potential bugs.

zhisbug commented 1 year ago

@bradfox2 Regarding GPTQ, is the performance degeneration specific to T5 or to all LLMs?

bradfox2 commented 1 year ago

@zhisbug AFAIK just T5

DachengLi1 commented 1 year ago

@merrymercy I tried the one in FastChat. It caused inf/nan element in the final output, will need to dig into it more.

limcheekin commented 1 year ago

The following converted and quantized model which run on cpu only should be helpful: https://huggingface.co/limcheekin/fastchat-t5-3b-ct2

bradfox2 commented 1 year ago

CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.

limcheekin commented 1 year ago

CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.

Appreciate if you could publish the evaluation metrics for CT2 vs GPTQ comparison. Kindly share what other 'degradation free' methods available here, it will benefit everyone following the thread.

Thanks.