Alternative Implementation of 8bit Quantization

Hi all, thanks a lot for the nice work introducing Vicuna and FastChat.

I am a beginner in NLP (so correct me if I am wrong) and use GPUs with limited memories, so I would like to train/infer with 8-bit quantization.

I learned that the HuggingFace transformers ships a built-in quantization and can be turned on by simply setting load_in_8bit=True (https://huggingface.co/docs/transformers/v4.28.1/main_classes/quantization). It claims to have nearly zero performance degradation (https://huggingface.co/blog/hf-bitsandbytes-integration), which sounds great for players with limited resources.

However, I found your quantization is implemented by yourselves. Is there any reason for this choice? I've tried comparing your implementation and the HuggingFace's, and the computation results are different.

Hope to hear from you soon.

lm-sys / FastChat

Alternative Implementation of 8bit Quantization #1043