lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.59k stars 4.52k forks source link

Alternative Implementation of 8bit Quantization #1043

Open Sissel-Wu opened 1 year ago

Sissel-Wu commented 1 year ago

Hi all, thanks a lot for the nice work introducing Vicuna and FastChat.

I am a beginner in NLP (so correct me if I am wrong) and use GPUs with limited memories, so I would like to train/infer with 8-bit quantization.

I learned that the HuggingFace transformers ships a built-in quantization and can be turned on by simply setting load_in_8bit=True (https://huggingface.co/docs/transformers/v4.28.1/main_classes/quantization). It claims to have nearly zero performance degradation (https://huggingface.co/blog/hf-bitsandbytes-integration), which sounds great for players with limited resources.

However, I found your quantization is implemented by yourselves. Is there any reason for this choice? I've tried comparing your implementation and the HuggingFace's, and the computation results are different.

Hope to hear from you soon.

abcbdf commented 1 year ago

also want to know