Open Sissel-Wu opened 1 year ago
Hi all, thanks a lot for the nice work introducing Vicuna and FastChat.
I am a beginner in NLP (so correct me if I am wrong) and use GPUs with limited memories, so I would like to train/infer with 8-bit quantization.
I learned that the HuggingFace transformers ships a built-in quantization and can be turned on by simply setting load_in_8bit=True (https://huggingface.co/docs/transformers/v4.28.1/main_classes/quantization). It claims to have nearly zero performance degradation (https://huggingface.co/blog/hf-bitsandbytes-integration), which sounds great for players with limited resources.
transformers
load_in_8bit=True
However, I found your quantization is implemented by yourselves. Is there any reason for this choice? I've tried comparing your implementation and the HuggingFace's, and the computation results are different.
Hope to hear from you soon.
also want to know
Hi all, thanks a lot for the nice work introducing Vicuna and FastChat.
I am a beginner in NLP (so correct me if I am wrong) and use GPUs with limited memories, so I would like to train/infer with 8-bit quantization.
I learned that the HuggingFace
transformers
ships a built-in quantization and can be turned on by simply settingload_in_8bit=True
(https://huggingface.co/docs/transformers/v4.28.1/main_classes/quantization). It claims to have nearly zero performance degradation (https://huggingface.co/blog/hf-bitsandbytes-integration), which sounds great for players with limited resources.However, I found your quantization is implemented by yourselves. Is there any reason for this choice? I've tried comparing your implementation and the HuggingFace's, and the computation results are different.
Hope to hear from you soon.