Open guillemram97 opened 1 week ago
Hi @guillemram97, thanks for reporting this issue 😊. Indeed it seems to be a bug related to how we load quantized models on accelerate
side. We are currently working on a fix to improve these edge cases. You can refer to the PR linked to the issue if you want to understand the details.
System Info
System Info
Hardware: Amazon Linux EC2 Instance. 8 NVIDIA A10G (23 GB)
Who can help?
@muellerz @SunMarc @MekkCyber
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Reproduction
However, if I load without the quantization_config, no issue at all:
Expected behavior
The model is (mostly) being loaded to the last GPU. However, I'd expect it to be loaded across the different GPUs. Moreover, infer_auto_device_map seems to be not working. I have experienced this very similar issue with different hardware.