huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
15.81k stars 1.53k forks source link

Gradient not appliable to 4-bit quantization. (sft-qlora-fsdp) #1923

Closed NotTheStallion closed 1 month ago

NotTheStallion commented 2 months ago

System Info

accelerate 0.31.0 peft 0.11.1 transformers 4.42.4 bitsandbytes 0.41.1

The following packages are not related to the error but were given for error reproducibility purposes.

torch 2.3.1 nvidia-nccl-cu12 2.20.5 # This is used as an accelerate backend

Who can help?

@BenjaminBossan @sayakpaul

Information

Tasks

Reproduction

Reproductibility

To reproduce the error run the bash script at `peft/examples/sft

/run_peft_qlora_fsdp.sh` with the same config end code.

Error

The following error is the error i get when i try fine-tuning with qlora.

[rank0]: Traceback (most recent call last): [rank0]: File "/home/ubuntu/PentestLLM/_local_cuisine/train.py", line 163, in [rank0]: main(model_args, data_args, training_args) [rank0]: File "/home/ubuntu/PentestLLM/_local_cuisine/train.py", line 107, in main [rank0]: model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args) [rank0]: File "/home/ubuntu/PentestLLM/_local_cuisine/utils.py", line 131, in create_and_prepare_model [rank0]: model = AutoModelForCausalLM.from_pretrained( [rank0]: File "/home/ubuntu/Myenv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained [rank0]: return model_class.from_pretrained( [rank0]: File "/home/ubuntu/Myenv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained [rank0]: ) = cls._load_pretrained_model( [rank0]: File "/home/ubuntu/Myenv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model [rank0]: new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( [rank0]: File "/home/ubuntu/Myenv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 904, in _load_state_dict_into_meta_model [rank0]: value = type(value)(value.data.to("cpu"), **value.dict) [rank0]: File "/home/ubuntu/Myenv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 147, in new [rank0]: self = torch.Tensor._make_subclass(cls, data, requires_grad) [rank0]: RuntimeError: Only Tensors of floating point and complex dtype can require gradients E0712 21:00:45.048000 136025371594752 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 37026) of binary: /home/ubuntu/Myenv/bin/python3

Attempted solutions

I tried :

Expected behavior

I expected my code not to show any error :stuck_out_tongue_winking_eye: and do peft-sft-qlora-fsdp fine-tuning on any model given to it.

BenjaminBossan commented 2 months ago

I just tried this script with some minor modifications and could run it successfully (2x 4090, same package versions):

3c3
< --model_name_or_path "meta-llama/Llama-2-70b-hf" \
---
> --model_name_or_path "meta-llama/Llama-2-7b-hf" \
9c9
< --max_seq_len 2048 \
---
> --max_seq_len 256 \
16,18d15
< --push_to_hub \
< --hub_private_repo True \
< --hub_strategy "every_save" \
26,28c23,25
< --output_dir "llama-sft-qlora-fsdp" \
< --per_device_train_batch_size 2 \
< --per_device_eval_batch_size 2 \
---
> --output_dir "/tmp/peft/llama-sft-qlora-fsdp" \
> --per_device_train_batch_size 1 \
> --per_device_eval_batch_size 1 \

Could you check if it works for you?

Also, please give more details: What is your accelerate env, what machine are you using?

NotTheStallion commented 2 months ago

The modification proposed didn't solve the problem.

Accelerate env output :

  • Accelerate version: 0.31.0
  • Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
  • accelerate bash location: /home/ubuntu/Myenv/bin/accelerate
  • Python version: 3.10.12
  • Numpy version: 1.26.4
  • PyTorch version (GPU?): 2.3.1+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • PyTorch MLU available: False
  • System RAM: 31.08 GB
  • GPU type: NVIDIA GeForce RTX 3060
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: FSDP
    • mixed_precision: no
    • use_cpu: False
    • debug: False
    • num_processes: 1
    • machine_rank: 0
    • num_machines: 1
    • main_process_ip: main-node
    • main_process_port: 5000
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • enable_cpu_affinity: False
    • fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': False}
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []

The machine i am using has 32 GB of RAM and the following system information : sudo dmidecode -t system

dmidecode 3.3

Getting SMBIOS data from sysfs. SMBIOS 3.4 present.

Handle 0x0100, DMI type 1, 27 bytes System Information Manufacturer: Dell Inc. Product Name: Precision 3650 Tower Version: Not Specified Serial Number: DMGVXK3 UUID: 4c4c4544-004d-4710-8056-c4c04f584b33 Wake-up Type: Other SKU Number: 0A58 Family: Precision

Handle 0x0C00, DMI type 12, 5 bytes System Configuration Options Option 1: J6H1:1-X Boot with Default; J8H1:1-X BIOS RECOVERY

Handle 0x2000, DMI type 32, 11 bytes System Boot Information Status: No errors detected

Additional information :

I would like to add that when i am testing the code i know that the LLM won't fit in my GPU but i know by testing run_peft_fsdp.sh that the error CUDA out of memory appears after successfully loading a few shards of the model. (This is information was given to defend my choice of testing on one machine first).

When i am testing on one machine that has a single GPU i change the following lines in the config.

num_machines: 1 # one machine num_processes: 1 # one GPU

BenjaminBossan commented 1 month ago

Hmm, strange that it works on my machine and not yours. Your accelerate env looks fine and is very similar to what I have. Could you try what happens if you deactivate PEFT, i.e. set --use_peft_lora False? Of course, we cannot train a quantized model without PEFT but since you mentioned that this is only about the loading step, it should be okay.

NotTheStallion commented 1 month ago

I set it to false but that didn't change the error message. The problem is that i am not sure where the problem is coming from. Is it really a torch problem or a silent error in the Auto class of Transformers.

BenjaminBossan commented 1 month ago

Thanks for testing this, I was suspecting it was not PEFT-related. Ideally, you could try to remove all unnecessary code, i.e. everything not related to the model config and loading. That way, you could check if it's really an issue with transformers/bitsandbytes. If you can do that and check if it works, that would be great. Please share the code and I'll also give it a try.

NotTheStallion commented 1 month ago

I found the solution to my error. - bitsandbytes 0.41.1 +bitsandbytes 0.43.1

I am not sure why it worked for you even though the version i specified earlier was wrong. Turns out, i just had bad compatibility between libraries.

Thank you for your assistance.

BenjaminBossan commented 1 month ago

Great that you could find the issue. Strange why I had no problem. Anyway, I'll close the PR for now, feel free to re-open if something else comes up.