TypeError: Bnb4BitHfQuantizer.create_quantized_param() missing 1 required positional argument: 'unexpected_keys'

seungwoos commented 9 months ago

System Info

transformers version: 4.38.0.dev0
Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.26.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'MistralDecoderLayer', 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I'm using Mistral-7b-v0.1 model to finetune my dataset. When I tried to work with QLoRA, I faced the problems on quantization:

When it comes to using Bitsandbytes quantization from latest release, it raises an type error. I suspect the following lines should take unexpected_keys argument. https://github.com/huggingface/transformers/blob/2a9b1f80c45cab19b542bc7cc004937d39d6f6fb/src/transformers/modeling_utils.py#L3737 https://github.com/huggingface/transformers/blob/1a77f07f6556b1482bd5e5f8399aa528727d1b47/src/transformers/modeling_utils.py#L3924
After I manually change the function to take the arg, I observed all the weights are passed into quantized including embedding layer. These lines seems not check whether the weights are quantizable or not https://github.com/huggingface/transformers/blob/2a9b1f80c45cab19b542bc7cc004937d39d6f6fb/src/transformers/modeling_utils.py#L3916-L3924 as in transformers.integrations.bitsandbytes.py. https://github.com/huggingface/transformers/blob/2a9b1f80c45cab19b542bc7cc004937d39d6f6fb/src/transformers/integrations/bitsandbytes.py#L66-L67 Should it be necessary for making another condition using the following function? https://github.com/huggingface/transformers/blob/2a9b1f80c45cab19b542bc7cc004937d39d6f6fb/src/transformers/quantizers/quantizer_bnb_4bit.py#L118-L132

Expected behavior

All the target (linear) layers should be quantized.

younesbelkada commented 9 months ago

Hi @seungwoos ! Thanks for the issue! Can you send me a small reproducer to repro the issue ? 🙏 I 'll have a look

seungwoos commented 9 months ago

@younesbelkada Thanks your for reply. I'm working with axolotl, and this example arose the same issue. While working on new virtual environment with latest transformers package, meaning creating quantization parameters is replaced with hf_quantizer, the error raised.

Would you mind check the repository and run the qlora as follows?

accelerate launch -m axolotl.cli.train examples/mistral/qlora.yml --deepspeed deepspeed_configs/zero2.json

seungwoos commented 9 months ago

Hmm, I guess the problem arose when applying quantization on multi-GPU with mixed accelerate configs.

Bnb quantizer works properly when working on single GPU.
I just found that my default accelerate config is set with distributed_type: FSDP while I was trying to use Deepspeed. Removing FSDP from the default config, it works fine for me. (I guess this is the reason why these lines are called

chenmoneygithub commented 9 months ago

@seungwoos I am seeing a similar issue when using FSDP. After some debugging, this is the problematic code: link, which misses the required arg unexpected_keys as defined here. My accelerate config is just a normal FSDP config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Could you help take another look? Now every FSDP workflow is broken on my side when launched with accelerator in multi-GPU scenario, thank you!

Transformer version: 4.38

seungwoos commented 9 months ago

@chenmoneygithub Hi, I guess the current implementation in hf_quantizer does not properly support FSDP yet. Not only the unexpected_keys argument is missing in hf_quantizer.create_quantized_param(), it seems there is no condition to check whether the parameter is bnb linear component or not.

If it is an emergent case, I checked changing this line as

from .integrations import set_module_quantized_tensor_to_device,
set_module_quantized_tensor_to_device(model_to_load,  key, "cpu", torch.empty(*param.size(), dtype=dtype))

works well as the previous transformers version did.

OR, just changing the accelerate config with fsdp_cpu_ram_efficient_loading: false also works in my environment.

Hope this helps :)

seungwoos commented 9 months ago

@younesbelkada Currently, hf_quantizer.create_quantized_param() in below lines should add argument unexpected_keys. https://github.com/huggingface/transformers/blob/831bc25d8fdb85768402f772cf65cc3d7872b211/src/transformers/modeling_utils.py#L3930 https://github.com/huggingface/transformers/blob/831bc25d8fdb85768402f772cf65cc3d7872b211/src/transformers/modeling_utils.py#L3743

Also it seems some discrepancies on bnb quantizer between transformers.integrations.bitsandbytes.py and transformers.quantizers.quantizer_bnb_{4,8}bit.py. Would you mind check the PR later after I spend some time to work with?

Thanks :)

younesbelkada commented 9 months ago

Hi @seungwoos I made #29420 - let me know if this fixes the issue ! I think we should make the fix as simple as possible by just making unexpected_keys optional

seungwoos commented 9 months ago

@younesbelkada Thanks for your quick response. Actually I found that by setting fsdp_cpu_ram_efficient_loading: false on accelerate config makes everything clear; there is no need to pass unexpected_keys. I guess it might depends on the system where mine ran out of RAM memory, but your PR might prevent unexpected problems from other users.

Thanks !

younesbelkada commented 9 months ago

OK makes sense, thanks !

huggingface / transformers