What does this PR do?

Do not output gigantic amount of warning output to console when loading a quantized model where the layers are replaced by QuantLinear as in GPTQModel library (fork of AutoGPTQ). AutoGPTQ also has similar problem.

This is not a bug fix but an end-user usability issue. As example when GPTQModel.from_quantized() load a quantized Qwen2MoE model, where there is massive amount of layers and experts, you get like hundreds to thousands of virtual lines of unexpected_keys warning pushed to the console/log.

To fix this, I added ignore_unexpected_keys property to loader method. Not sure this was the best way to do it. Let me know if there is a better way around this.

This needs to be fixed because users think this is a bug. It is not a bug but the warning verbosity is so great that it becomes a bug in user's perspective. Imagine yourself as a user and presented with several hundreds screen lines of warning on your terminal.

For a quantized model, the warnings should not there in the first place and should only be printed in debug mode. This toggle allow that manual control.

TEST

[X] PASSED: Tested true/false + strict=false

@muellerzr @BenjaminBossan @SunMarc

huggingface / accelerate

Add ignore_unexpected_keys arg to load_checkpoint_in_model() #2880

What does this PR do?