Closed NotTheStallion closed 3 months ago
I just tried this script with some minor modifications and could run it successfully (2x 4090, same package versions):
3c3
< --model_name_or_path "meta-llama/Llama-2-70b-hf" \
---
> --model_name_or_path "meta-llama/Llama-2-7b-hf" \
9c9
< --max_seq_len 2048 \
---
> --max_seq_len 256 \
16,18d15
< --push_to_hub \
< --hub_private_repo True \
< --hub_strategy "every_save" \
26,28c23,25
< --output_dir "llama-sft-qlora-fsdp" \
< --per_device_train_batch_size 2 \
< --per_device_eval_batch_size 2 \
---
> --output_dir "/tmp/peft/llama-sft-qlora-fsdp" \
> --per_device_train_batch_size 1 \
> --per_device_eval_batch_size 1 \
Could you check if it works for you?
Also, please give more details: What is your accelerate env
, what machine are you using?
The modification proposed didn't solve the problem.
Accelerate env
output :
Accelerate
version: 0.31.0- Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
accelerate
bash location: /home/ubuntu/Myenv/bin/accelerate- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 31.08 GB
- GPU type: NVIDIA GeForce RTX 3060
Accelerate
default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- main_process_ip: main-node
- main_process_port: 5000
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': False}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
The machine i am using has 32 GB of RAM and the following system information : sudo dmidecode -t system
dmidecode 3.3
Getting SMBIOS data from sysfs. SMBIOS 3.4 present.
Handle 0x0100, DMI type 1, 27 bytes System Information Manufacturer: Dell Inc. Product Name: Precision 3650 Tower Version: Not Specified Serial Number: DMGVXK3 UUID: 4c4c4544-004d-4710-8056-c4c04f584b33 Wake-up Type: Other SKU Number: 0A58 Family: Precision
Handle 0x0C00, DMI type 12, 5 bytes System Configuration Options Option 1: J6H1:1-X Boot with Default; J8H1:1-X BIOS RECOVERY
Handle 0x2000, DMI type 32, 11 bytes System Boot Information Status: No errors detected
I would like to add that when i am testing the code i know that the LLM won't fit in my GPU but i know by testing run_peft_fsdp.sh
that the error CUDA out of memory
appears after successfully loading a few shards of the model. (This is information was given to defend my choice of testing on one machine first).
When i am testing on one machine that has a single GPU i change the following lines in the config.
num_machines: 1 # one machine num_processes: 1 # one GPU
Hmm, strange that it works on my machine and not yours. Your accelerate env looks fine and is very similar to what I have. Could you try what happens if you deactivate PEFT, i.e. set --use_peft_lora False
? Of course, we cannot train a quantized model without PEFT but since you mentioned that this is only about the loading step, it should be okay.
I set it to false but that didn't change the error message.
The problem is that i am not sure where the problem is coming from. Is it really a torch
problem or a silent error in the Auto
class of Transformers
.
Thanks for testing this, I was suspecting it was not PEFT-related. Ideally, you could try to remove all unnecessary code, i.e. everything not related to the model config and loading. That way, you could check if it's really an issue with transformers/bitsandbytes. If you can do that and check if it works, that would be great. Please share the code and I'll also give it a try.
I found the solution to my error. - bitsandbytes 0.41.1 +bitsandbytes 0.43.1
I am not sure why it worked for you even though the version i specified earlier was wrong. Turns out, i just had bad compatibility between libraries.
Thank you for your assistance.
Great that you could find the issue. Strange why I had no problem. Anyway, I'll close the PR for now, feel free to re-open if something else comes up.
System Info
accelerate 0.31.0 peft 0.11.1 transformers 4.42.4 bitsandbytes 0.41.1
The following packages are not related to the error but were given for error reproducibility purposes.
torch 2.3.1 nvidia-nccl-cu12 2.20.5 # This is used as an accelerate backend
Who can help?
@BenjaminBossan @sayakpaul
Information
Tasks
examples
folderReproduction
Reproductibility
To reproduce the error run the bash script at `peft/examples/sft
/run_peft_qlora_fsdp.sh` with the same config end code.
Error
The following error is the error i get when i try fine-tuning with qlora.
Attempted solutions
I tried :
Expected behavior
I expected my code not to show any error :stuck_out_tongue_winking_eye: and do peft-sft-qlora-fsdp fine-tuning on any model given to it.