rmihaylov / falcontune

Tune any FALCON in 4-bit
Apache License 2.0
468 stars 52 forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! #30

Open yufengzhe1 opened 1 year ago

yufengzhe1 commented 1 year ago

how to solve it?

yufengzhe1 commented 1 year ago

Traceback (most recent call last): File "/data/falcontune-main/falcontune/run.py", line 93, in main() File "/data/falcontune-main/falcontune/run.py", line 89, in main args.func(args) File "/data/falcontune-main/falcontune/finetune.py", line 162, in finetune trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss outputs = model(inputs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 827, in forward return self.base_model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, kwargs) File "/data/falcontune-main/falcontune/model/falcon/model.py", line 1070, in forward transformer_outputs = self.transformer( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/data/falcontune-main/falcontune/model/falcon/model.py", line 965, in forward outputs = block( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/data/falcontune-main/falcontune/model/falcon/model.py", line 652, in forward mlp_output + attention_output, residual, self.config.hidden_dropout, training=self.training RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

yufengzhe1 commented 1 year ago

@rmihaylov @rmmihaylov

clechristophe commented 1 year ago

I am running into the same issue when trying to finetune with LoRA on multiple GPUs. It works well if I apply LoRA only on target_modules = query_key_value but as soon as I want to apply it to other layers I have the same issue.

rcshubhadeep commented 1 year ago

I have a multi GPU setup with A100 40GB and I am getting the same problem. Here is the command I am using -

falcontune finetune --model=falcon-40b --weights=tiiuae/falcon-40b --dataset=./alpaca_data_cleaned.json --data_type=alpaca --lora_out_dir=./falcon-40b-alpaca/ --mbatch_size=1 --batch_size=16 --epochs=3 --lr=3e-4 --cutoff_len=256 --lora_r=8 --lora_alpha=16 --lora_dropout=0.05 --warmup_steps=5 --save_steps=100 --save_total_limit=1 --logging_steps=5 --target_modules='["query_key_value"]'

I have set up the WORLD_SIZE=8 as environment var.

How do we solve this? This is preventing me from using this library to fine tune things.

I tried to run using torchrun as mentioned here the command I tried is the following OMP_NUM_THREADS=8 WORLD_SIZE=8 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib64:/usr/lib/x86_64-linux-gnu torchrun --nproc_per_node=8 --master_port=1234 falcontune/run.py finetune --model=falcon-40b --weights=tiiuae/falcon-40b --dataset=./alpaca_data_cleaned.json --data_type=alpaca --lora_out_dir=./falcon-40b-alpaca/ --mbatch_size=1 --batch_size=16 --epochs=3 --lr=3e-4 --cutoff_len=256 --lora_r=8 --lora_alpha=16 --lora_dropout=0.05 --warmup_steps=5 --save_steps=100 --save_total_limit=1 --logging_steps=5 --target_modules='["query_key_value"]'

This throws CUDA OOM error... How can I run it using distributed settings?

Please help

zepmck commented 1 year ago

Reduce the bs.

However, is the multi gpu setting working?

RYANSTOBBE commented 1 year ago

Will multi GPUs work has anyone been able to use this for 2 GPUs I ask because if 40B only requires 40GB of VRAM I would assume but could be wrong that 2x3090s or 2x4090s should work?