rmihaylov / falcontune

Tune any FALCON in 4-bit
Apache License 2.0
468 stars 52 forks source link

TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations #7

Open 631068264 opened 1 year ago

631068264 commented 1 year ago
Parameters:
-------config-------
dataset='./alpaca_data_zh_51k.json'
data_type='alpaca'
lora_out_dir='./falcon-40b-alpaca/'
lora_apply_dir=None
weights='tiiuae/falcon-40b'
target_modules=['query_key_value']

------training------
mbatch_size=1
batch_size=2
gradient_accumulation_steps=2
epochs=3
lr=0.0003
cutoff_len=256
lora_r=8
lora_alpha=16
lora_dropout=0.05
val_set_size=0.2
gradient_checkpointing=False
gradient_checkpointing_ratio=1
warmup_steps=5
save_steps=50
save_total_limit=3
logging_steps=5
checkpoint=False
skip=False
world_size=1
ddp=False
device_map='auto'
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.3
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: token_type_ids, output, input, instruction. If token_type_ids, output, input, instruction are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 40,943
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 2
  Total optimization steps = 61,413
  Number of trainable parameters = 8,355,840
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
  0%|                                                                                                                                    | 0/61413 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /data/home/yaokj5/dl/apps/falcon/wandb/offline-run-20230602_104725-mpuuxpn6
wandb: Find logs at: ./wandb/offline-run-20230602_104725-mpuuxpn6/logs
Traceback (most recent call last):
  File "/data/home/yaokj5/anaconda3/envs/falcon/bin/falcontune", line 33, in <module>
    sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 87, in main
    args.func(args)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/finetune.py", line 162, in finetune
    trainer.train()
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1076, in forward
    transformer_outputs = self.transformer(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 971, in forward
    outputs = block(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 639, in forward
    attn_outputs = self.self_attention(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 491, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/peft/tuners/lora.py", line 698, in forward
    result = super().forward(x)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 388, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 559, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1759, in igemmlt
    is_on_gpu([A, B, out])
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/functional.py", line 390, in is_on_gpu
    raise TypeError(f'Input tensors need to be on the same GPU, but found the following tensor and device combinations:\n {[(t.shape, t.device) for t in tensors]}')
TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations:
 [(torch.Size([256, 8192]), device(type='cuda', index=0)), (torch.Size([9216, 8192]), device(type='cuda', index=1)), (torch.Size([256, 9216]), device(type='cuda', index=0))]
codybum commented 1 year ago

The code does not appear to play well with multi-gpu. One way to get around this is to set the environment variable "CUDA_VISIBLE_DEVICES=[device number]" to the GPU you want to use.

I don't know the complications associated with making this code work with multi-gpu.

Cody