huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
16.32k stars 1.61k forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) #1831

Closed kevalshah90 closed 4 months ago

kevalshah90 commented 5 months ago

System Info

Transformers 4.41.2
peft  0.11.1

Single T4 GPU.

I am implementing QLoRA for fine-tuning a mistral-7b on a T4 gpu. I loaded the model with the quantized configuration, however, when I attempt to test the model I keep getting an runtime error related to device. I ensured that the model and inputs are on cuda.

peft_model.device
device(type='cuda', index=0)

# Test
inputs = tokenizer("Do you have time", return_tensors="pt").input_ids.to(device)
print("Inputs:", inputs)
Inputs: tensor([[   1, 2378,  368,  506,  727]], device='cuda:0')

Who can help?

@sayakpaul @younesbelkada @BenjaminBossan

Information

Tasks

Reproduction

import bitsandbytes
from peft import LoraConfig, get_peft_model

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

bnb_config = BitsAndBytesConfig(
                                    load_in_4bit=True, =
                                    bnb_4bit_quant_type="nf4", 
                                    bnb_4bit_compute_dtype=torch.bfloat16, 
                                    bnb_4bit_use_double_quant=True, 
                                    bnb_4bit_quant_storage=torch.bfloat16 
                               )

model = AutoModelForCausalLM.from_pretrained(
                                                "mistralai/Mistral-7B-v0.1"
                                                quantization_config=bnb_config,
                                            )

# Lora
config = LoraConfig(
                        r=32, 
                        lora_alpha=64, 
                        lora_dropout=0.01, 
                        target_modules=["q_proj", "k_proj", "v_proj"], 
                        bias="none", 
                        task_type="CAUSAL_LM"
                   )

 peft_model = get_peft_model(model, config)

inputs = tokenizer("Do you have time", return_tensors="pt")
print("Inputs:", inputs)

{'input_ids': tensor([[   1, 2378,  368,  506,  727]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')} 

 with torch.no_grad():
    outputs = peft_model(**inputs)

print("Outputs:\n", outputs.logits)

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[44], line 2
      1 with torch.no_grad():
----> 2     outputs = peft_model(**inputs)
      4 print("Outputs:\n", outputs.logits)
      5 print("Outputs dimensions:", outputs.logits.shape) # shape: (batch_size, num_tokens, num_classes)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/peft/peft_model.py:1430, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)
   1428     with self._enable_peft_forward_hooks(**kwargs):
   1429         kwargs = {k: v for k, v in kwargs.items() if k not in self.special_peft_forward_args}
-> 1430         return self.base_model(
   1431             input_ids=input_ids,
   1432             attention_mask=attention_mask,
   1433             inputs_embeds=inputs_embeds,
   1434             labels=labels,
   1435             output_attentions=output_attentions,
   1436             output_hidden_states=output_hidden_states,
   1437             return_dict=return_dict,
   1438             **kwargs,
   1439         )
   1441 batch_size = _get_batch_size(input_ids, inputs_embeds)
   1442 if attention_mask is not None:
   1443     # concat prompt attention mask

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:179, in BaseTuner.forward(self, *args, **kwargs)
    178 def forward(self, *args: Any, **kwargs: Any):
--> 179     return self.model.forward(*args, **kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164         output = module._old_forward(*args, **kwargs)
    165 else:
--> 166     output = module._old_forward(*args, **kwargs)
    167 return module._hf_hook.post_forward(module, output)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:1152, in MistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1139 outputs = self.model(
   1140     input_ids=input_ids,
   1141     attention_mask=attention_mask,
   (...)
   1148     return_dict=return_dict,
   1149 )
   1151 hidden_states = outputs[0]
-> 1152 logits = self.lm_head(hidden_states)
   1153 logits = logits.float()
   1155 loss = None

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/nn/modules/linear.py:116, in Linear.forward(self, input)
    115 def forward(self, input: Tensor) -> Tensor:
--> 116     return F.linear(input, self.weight, self.bias)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Expected behavior

Output logits.

younesbelkada commented 5 months ago

Hi @kevalshah90 Thanks for the issue! You might be hitting a corner case here, can you try to force-set the model on a single GPU with device_map="cuda:0" in from_pretrained ?

kevalshah90 commented 5 months ago

@younesbelkada I am still getting the same error.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I added device_map :

model = AutoModelForCausalLM.from_pretrained(
                                                "mistralai/Mistral-7B-v0.1",
                                                quantization_config=bnb_config,
                                                device_map="cuda:0"
                                            )

model

For reference, here is my bnb_config:

bnb_config = BitsAndBytesConfig(
                                    load_in_4bit=True, 
                                    bnb_4bit_quant_type="nf4", 
                                    bnb_4bit_compute_dtype=torch.bfloat16, 
                                    bnb_4bit_use_double_quant=True, 
                                    bnb_4bit_quant_storage=torch.bfloat16 
                               )

bnb_config
kevalshah90 commented 5 months ago

@younesbelkada I tried to downgrade the accelerate from 0.30.0 to 0.29.0 and am still getting the same error.

any ideas / suggestions on debugging this issue? I am not sure why it would split across gpu and cpu.

younesbelkada commented 5 months ago

@kevalshah90 can you double check the inputs are on GPU? Maybe doing inputs = tokenizer("Do you have time", return_tensors="pt").to(0) would help

kevalshah90 commented 5 months ago

@younesbelkada Yes, I tried .to(0) from the demo notebook here.

I still get the same error. Before passing the input to the model, I am also printing the input and model device,

peft_model.device
device(type='cuda', index=0)

# Test
inputs = tokenizer("Do you have time", return_tensors="pt").input_ids.to(0)
print("Inputs:", inputs)

Inputs: tensor([[   1, 2378,  368,  506,  727]], device='cuda:0')

with torch.no_grad():
    outputs = peft_model(inputs)

print("Outputs:\n", outputs.logits)
print("Outputs dimensions:", outputs.logits.shape)

Error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Accelerate library seems to be splitting between the gpu and cpu under-the-hood.

kevalshah90 commented 5 months ago

https://huggingface.co/docs/accelerate/concept_guides/big_model_inference#loading-weights

I am thinking maybe there is something in here:

[infer_auto_device_map()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.infer_auto_device_map) (or device_map="auto" in [load_checkpoint_and_dispatch()]

(https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch)) tries to maximize GPU and CPU RAM it sees available when you execute it. While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it’s not entirely true with Python and CPU RAM. Therefore, an automatically computed device map might be too intense on the CPU. Move a few modules to the disk device if you get crashes due to a lack of RAM.

[infer_auto_device_map()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.infer_auto_device_map) (or device_map="auto" in [load_checkpoint_and_dispatch()]

(https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch)) attributes devices sequentially (to avoid moving things back and forth) so if your first layer is bigger than the size of the GPU you have, it will end up with everything on the CPU/Disk.

[load_checkpoint_and_dispatch()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch) and [load_checkpoint_in_model()](https://huggingface.co/docs/accelerate/v0.30.1/en/package_reference/big_modeling#accelerate.load_checkpoint_in_model) do not perform any check on the correctness of your state dict compared to your model at the moment (this will be fixed in a future version), so you may get some weird errors if trying to load a checkpoint with mismatched or missing keys.
The model parallelism used when your model is split on several GPUs is naive and not optimized, meaning that only one GPU works at a given time and the other sits idle.

When weights are offloaded on the CPU/hard drive, there is no pre-fetching (yet, we will work on this for future versions) which means the weights are put on the GPU when they are needed and not before.
kevalshah90 commented 5 months ago

@younesbelkada

For context, we are on aws ml.g4dn.xlarge instance which is a NVIDIA T4 GPU with 16 GiB memory.

kevalshah90 commented 5 months ago

@younesbelkada

Here's how it seems to have been fixed, I added the below few lines of code:

# Set the device (replace 'cuda:0' with the appropriate GPU if you have multiple GPUs)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# Set the device for PyTorch
torch.cuda.set_device(device)

peft_model = peft_model.to(device)
peft_model.hf_device_map

{'': device(type='cuda', index=0)}
younesbelkada commented 4 months ago

Thanks for explaining the fix ! This might be related too: https://github.com/huggingface/peft/issues/1840