Some models like Qwen overflow/downflow with CUDA kernel

intel / auto-round

Advanced Quantization Algorithm for LLMs/VLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"

https://arxiv.org/abs/2309.05516

Apache License 2.0

261 stars 22 forks source link

Some models like Qwen overflow/downflow with CUDA kernel #323

Closed wenhuach21 closed 1 week ago

wenhuach21 commented 2 weeks ago

Cuda kernel only supports FP16, while the max value of some layers of Qwen is very large

wenhuach21 commented 2 weeks ago

workaround： try different configs and use_clip in autoround kernel

wenhuach21 commented 1 week ago

fall back these layers