Open yufengzhe1 opened 1 year ago
Traceback (most recent call last):
File "/data/falcontune-main/falcontune/run.py", line 93, in
@rmihaylov @rmmihaylov
I am running into the same issue when trying to finetune with LoRA on multiple GPUs. It works well if I apply LoRA only on target_modules = query_key_value
but as soon as I want to apply it to other layers I have the same issue.
I have a multi GPU setup with A100 40GB and I am getting the same problem. Here is the command I am using -
falcontune finetune --model=falcon-40b --weights=tiiuae/falcon-40b --dataset=./alpaca_data_cleaned.json --data_type=alpaca --lora_out_dir=./falcon-40b-alpaca/ --mbatch_size=1 --batch_size=16 --epochs=3 --lr=3e-4 --cutoff_len=256 --lora_r=8 --lora_alpha=16 --lora_dropout=0.05 --warmup_steps=5 --save_steps=100 --save_total_limit=1 --logging_steps=5 --target_modules='["query_key_value"]'
I have set up the WORLD_SIZE=8 as environment var.
How do we solve this? This is preventing me from using this library to fine tune things.
I tried to run using torchrun
as mentioned here the command I tried is the following OMP_NUM_THREADS=8 WORLD_SIZE=8 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib64:/usr/lib/x86_64-linux-gnu torchrun --nproc_per_node=8 --master_port=1234 falcontune/run.py finetune --model=falcon-40b --weights=tiiuae/falcon-40b --dataset=./alpaca_data_cleaned.json --data_type=alpaca --lora_out_dir=./falcon-40b-alpaca/ --mbatch_size=1 --batch_size=16 --epochs=3 --lr=3e-4 --cutoff_len=256 --lora_r=8 --lora_alpha=16 --lora_dropout=0.05 --warmup_steps=5 --save_steps=100 --save_total_limit=1 --logging_steps=5 --target_modules='["query_key_value"]'
This throws CUDA OOM error... How can I run it using distributed settings?
Please help
Reduce the bs.
However, is the multi gpu setting working?
Will multi GPUs work has anyone been able to use this for 2 GPUs I ask because if 40B only requires 40GB of VRAM I would assume but could be wrong that 2x3090s or 2x4090s should work?
how to solve it?