RuntimeError when using device_map with GPT2Model.from_pretrained across multiple CUDA devices

Hongjie1Chu commented 7 months ago

System Info

transformers version: 4.30.2
Platform: Linux-4.15.0-176-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.16
Huggingface_hub version: 0.16.4
Safetensors version: 0.4.1
PyTorch version (GPU?): 1.13.1+cu116 (True)
Tensorflow version (GPU?): 2.11.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @younesbelkada

Issue Description:

I am using GPT2Model.from_pretrained to load a pre-trained GPT-2 model with a device_map specifying the device for each operator. The model loads without any issues:

from transformers import GPT2Model

device_map = { ... }  # Detailed device_map specifying device for each operator
model = GPT2Model.from_pretrained('gpt2', device_map=device_map)

However, when I attempt to get the output from the model, I encounter a RuntimeError indicating that tensors are found on at least two different devices:

RuntimeError                              Traceback (most recent call last)
Cell In[294], line 5
      3 print()
      4 encoded_input=encoded_input.to("cuda")
----> 5 print(model(**encoded_input))

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    163         output = module._old_forward(*args, **kwargs)
    164 else:
--> 165     output = module._old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py:837, in GPT2Model.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict)
    834 head_mask = self.get_head_mask(head_mask, self.config.n_layer)
    836 if inputs_embeds is None:
--> 837     inputs_embeds = self.wte(input_ids)
    838 position_embeds = self.wpe(position_ids)
    839 hidden_states = inputs_embeds + position_embeds

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    163         output = module._old_forward(*args, **kwargs)
    164 else:
--> 165     output = module._old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/sparse.py:162, in Embedding.forward(self, input)
    161 def forward(self, input: Tensor) -> Tensor:
--> 162     return F.embedding(
    163         input, self.weight, self.padding_idx, self.max_norm,
    164         self.norm_type, self.scale_grad_by_freq, self.sparse)

File /opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:2233, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2227     # Note [embedding_renorm set_grad_enabled]
   2228     # XXX: equivalent to
   2229     # with torch.no_grad():
   2230     #   torch.embedding_renorm_
   2231     # remove once script supports set_grad_enabled
   2232     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2233 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

This error does not occur when I assign an entire block to a single device. Is it possible that the error is caused by the granularity of my device_map?

What I Have Tried:

I have ensured that the device_map correctly assigns each operator to a CUDA device.
I ran the model without specifying a device_map, and it worked without errors by keeping all tensors on a single device.
I attempted to adjust the node order in device_map but faced the same issue.

Questions:

Are there limitations or known issues with how granular the device_map can be when specifying devices for individual operators?
Could the error be related to the internal model operations expecting tensors to be co-located on the same device?
What are the best practices for assigning operators to different devices to avoid such conflicts?

Any insights or suggestions to resolve this error would be greatly appreciated.

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import GPT2Tokenizer, GPT2Model,GPT2LMHeadModel from transformers.utils.fx import symbolic_trace import json tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2Model.from_pretrained('gpt2') print(model) text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt')

new_device_map = {'wte': 'cuda:1', 'wpe': 'cuda:1', 'drop': 'cuda:1', 'h.0.ln_1': 'cuda:1', 'h.0.attn.c_attn.bias': 'cuda:1', 'h.0.attn.c_attn.weight': 'cuda:1', 'h.0.attn.c_proj.bias': 'cuda:1', 'h.0.attn.c_proj.weight': 'cuda:1', 'h.0.attn.attn_dropout': 'cuda:1', 'h.0.attn.resid_dropout': 'cuda:1', 'h.0.ln_2': 'cuda:1', 'h.0.mlp.c_fc.bias': 'cuda:1', 'h.0.mlp.c_fc.weight': 'cuda:1', 'h.0.mlp.c_proj.bias': 'cuda:1', 'h.0.mlp.c_proj.weight': 'cuda:1', 'h.0.mlp.dropout': 'cuda:1', 'h.1.ln_1': 'cuda:1', 'h.1.attn.c_attn.bias': 'cuda:1', 'h.1.attn.c_attn.weight': 'cuda:1', 'h.1.attn.c_proj.bias': 'cuda:1', 'h.1.attn.c_proj.weight': 'cuda:1', 'h.1.attn.attn_dropout': 'cuda:1', 'h.1.attn.resid_dropout': 'cuda:1', 'h.1.ln_2': 'cuda:1', 'h.1.mlp.c_fc.bias': 'cuda:1', 'h.1.mlp.c_fc.weight': 'cuda:1', 'h.1.mlp.c_proj.bias': 'cuda:1', 'h.1.mlp.c_proj.weight': 'cuda:1', 'h.1.mlp.dropout': 'cuda:1', 'h.2.ln_1': 'cuda:1', 'h.2.attn.c_attn.bias': 'cuda:1', 'h.2.attn.c_attn.weight': 'cuda:1', 'h.2.attn.c_proj.bias': 'cuda:1', 'h.2.attn.c_proj.weight': 'cuda:1', 'h.2.attn.attn_dropout': 'cuda:1', 'h.2.attn.resid_dropout': 'cuda:1', 'h.2.ln_2': 'cuda:1', 'h.2.mlp.c_fc.bias': 'cuda:1', 'h.2.mlp.c_fc.weight': 'cuda:1', 'h.2.mlp.c_proj.bias': 'cuda:1', 'h.2.mlp.c_proj.weight': 'cuda:1', 'h.2.mlp.dropout': 'cuda:1', 'h.3.ln_1': 'cuda:1', 'h.3.attn.c_attn.bias': 'cuda:1', 'h.3.attn.c_attn.weight': 'cuda:1', 'h.3.attn.c_proj.bias': 'cuda:1', 'h.3.attn.c_proj.weight': 'cuda:1', 'h.3.attn.attn_dropout': 'cuda:1', 'h.3.attn.resid_dropout': 'cuda:1', 'h.3.ln_2': 'cuda:1', 'h.3.mlp.c_fc.bias': 'cuda:1', 'h.3.mlp.c_fc.weight': 'cuda:1', 'h.3.mlp.c_proj.bias': 'cuda:1', 'h.3.mlp.c_proj.weight': 'cuda:1', 'h.3.mlp.dropout': 'cuda:1', 'h.4.ln_1': 'cuda:1', 'h.4.attn.c_attn.bias': 'cuda:1', 'h.4.attn.c_attn.weight': 'cuda:1', 'h.4.attn.c_proj.bias': 'cuda:1', 'h.4.attn.c_proj.weight': 'cuda:1', 'h.4.attn.attn_dropout': 'cuda:0', 'h.4.attn.resid_dropout': 'cuda:0', 'h.4.ln_2': 'cuda:0', 'h.4.mlp.c_fc.bias': 'cuda:1', 'h.4.mlp.c_fc.weight': 'cuda:1', 'h.4.mlp.c_proj.bias': 'cuda:1', 'h.4.mlp.c_proj.weight': 'cuda:1', 'h.4.mlp.dropout': 'cuda:0', 'h.5.ln_1': 'cuda:0', 'h.5.attn.c_attn.bias': 'cuda:1', 'h.5.attn.c_attn.weight': 'cuda:1', 'h.5.attn.c_proj.bias': 'cuda:1', 'h.5.attn.c_proj.weight': 'cuda:1', 'h.5.attn.attn_dropout': 'cuda:0', 'h.5.attn.resid_dropout': 'cuda:0', 'h.5.ln_2': 'cuda:0', 'h.5.mlp.c_fc.bias': 'cuda:1', 'h.5.mlp.c_fc.weight': 'cuda:1', 'h.5.mlp.c_proj.bias': 'cuda:1', 'h.5.mlp.c_proj.weight': 'cuda:1', 'h.5.mlp.dropout': 'cuda:0', 'h.6.ln_1': 'cuda:0', 'h.6.attn.c_attn.bias': 'cuda:1', 'h.6.attn.c_attn.weight': 'cuda:1', 'h.6.attn.c_proj.bias': 'cuda:1', 'h.6.attn.c_proj.weight': 'cuda:1', 'h.6.attn.attn_dropout': 'cuda:0', 'h.6.attn.resid_dropout': 'cuda:0', 'h.6.ln_2': 'cuda:0', 'h.6.mlp.c_fc.bias': 'cuda:1', 'h.6.mlp.c_fc.weight': 'cuda:1', 'h.6.mlp.c_proj.bias': 'cuda:1', 'h.6.mlp.c_proj.weight': 'cuda:1', 'h.6.mlp.dropout': 'cuda:0', 'h.7.ln_1': 'cuda:0', 'h.7.attn.c_attn.bias': 'cuda:1', 'h.7.attn.c_attn.weight': 'cuda:1', 'h.7.attn.c_proj.bias': 'cuda:1', 'h.7.attn.c_proj.weight': 'cuda:1', 'h.7.attn.attn_dropout': 'cuda:0', 'h.7.attn.resid_dropout': 'cuda:0', 'h.7.ln_2': 'cuda:0', 'h.7.mlp.c_fc.bias': 'cuda:1', 'h.7.mlp.c_fc.weight': 'cuda:1', 'h.7.mlp.c_proj.bias': 'cuda:1', 'h.7.mlp.c_proj.weight': 'cuda:1', 'h.7.mlp.dropout': 'cuda:0', 'h.8.ln_1': 'cuda:0', 'h.8.attn.c_attn.bias': 'cuda:1', 'h.8.attn.c_attn.weight': 'cuda:1', 'h.8.attn.c_proj.bias': 'cuda:1', 'h.8.attn.c_proj.weight': 'cuda:1', 'h.8.attn.attn_dropout': 'cuda:0', 'h.8.attn.resid_dropout': 'cuda:0', 'h.8.ln_2': 'cuda:0', 'h.8.mlp.c_fc.bias': 'cuda:1', 'h.8.mlp.c_fc.weight': 'cuda:1', 'h.8.mlp.c_proj.bias': 'cuda:1', 'h.8.mlp.c_proj.weight': 'cuda:1', 'h.8.mlp.dropout': 'cuda:0', 'h.9.ln_1': 'cuda:0', 'h.9.attn.c_attn.bias': 'cuda:1', 'h.9.attn.c_attn.weight': 'cuda:1', 'h.9.attn.c_proj.bias': 'cuda:1', 'h.9.attn.c_proj.weight': 'cuda:1', 'h.9.attn.attn_dropout': 'cuda:0', 'h.9.attn.resid_dropout': 'cuda:0', 'h.9.ln_2': 'cuda:0', 'h.9.mlp.c_fc.bias': 'cuda:1', 'h.9.mlp.c_fc.weight': 'cuda:1', 'h.9.mlp.c_proj.bias': 'cuda:1', 'h.9.mlp.c_proj.weight': 'cuda:1', 'h.9.mlp.dropout': 'cuda:0', 'h.10.ln_1': 'cuda:0', 'h.10.attn.c_attn.bias': 'cuda:1', 'h.10.attn.c_attn.weight': 'cuda:1', 'h.10.attn.c_proj.bias': 'cuda:1', 'h.10.attn.c_proj.weight': 'cuda:1', 'h.10.attn.attn_dropout': 'cuda:0', 'h.10.attn.resid_dropout': 'cuda:0', 'h.10.ln_2': 'cuda:0', 'h.10.mlp.c_fc.bias': 'cuda:1', 'h.10.mlp.c_fc.weight': 'cuda:1', 'h.10.mlp.c_proj.bias': 'cuda:1', 'h.10.mlp.c_proj.weight': 'cuda:1', 'h.10.mlp.dropout': 'cuda:0', 'h.11.ln_1': 'cuda:0', 'h.11.attn.c_attn.bias': 'cuda:1', 'h.11.attn.c_attn.weight': 'cuda:1', 'h.11.attn.c_proj.bias': 'cuda:1', 'h.11.attn.c_proj.weight': 'cuda:1', 'h.11.attn.attn_dropout': 'cuda:0', 'h.11.attn.resid_dropout': 'cuda:0', 'h.11.ln_2': 'cuda:0', 'h.11.mlp.c_fc.bias': 'cuda:1', 'h.11.mlp.c_fc.weight': 'cuda:1', 'h.11.mlp.c_proj.bias': 'cuda:1', 'h.11.mlp.c_proj.weight': 'cuda:1', 'h.11.mlp.dropout': 'cuda:0', 'ln_f': 'cuda:0'}

model = GPT2Model.from_pretrained('gpt2',device_map=new_device_map) model=model.to("cuda") print() encoded_input=encoded_input.to("cuda") print(model(**encoded_input))

Expected behavior

I expect that by providing a detailed device_map to GPT2Model.from_pretrained, the model would be able to distribute its operations across the specified devices without encountering errors. My anticipation is that the forward pass would execute smoothly, with tensors for each operation residing on the devices I have designated in the device_map.

Specifically, with my device_map configuration, the expected outcome is that the model should generate outputs efficiently, utilizing the computational resources of multiple GPUs exactly as specified, without any RuntimeError related to device mismatches or tensor placements.

In essence, I am seeking clarity on why the current detailed device mapping results in errors and guidance on how to modify my device_map to achieve the intended multidevice execution without issues.

ArthurZucker commented 6 months ago

Could you upgrade the version of transformers you are using ? cc @SunMarc

SunMarc commented 6 months ago

Hi @Hongjie1Chu, please read the following concept guide and particularly the section on how to design your device_map. You need to be careful to put the layers with residual connection in the same device. What works in practice is stop at the level of a layer. For example, let's take the h.7 layer. You want to modify that:

{
'h.7.ln_1': 'cuda:0',
'h.7.attn.c_attn.bias': 'cuda:1',
'h.7.attn.c_attn.weight': 'cuda:1',
'h.7.attn.c_proj.bias': 'cuda:1',
'h.7.attn.c_proj.weight': 'cuda:1',
'h.7.attn.attn_dropout': 'cuda:0',
'h.7.attn.resid_dropout': 'cuda:0',
'h.7.ln_2': 'cuda:0'
}

to

{
'h.7':'cuda:0'
}

Otherwise, you will get an error. Also, you can also set device_map = "auto" if you don't want to manually do the allocation.

Hongjie1Chu commented 6 months ago

OK, I understand, thank you. Now I have another question. Does the interface from_pretrained support cross-node inference?The above code is the inference of four cards in a single machine. If I want to implement inference on two machines with four GPUs in each machine, can I still use from_pretrained? Can DDP implement multi-machine and multi-card inference? What are the methods or interfaces to implement multi-machine and multi-card inference on the pre-trained model? @SunMarc

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers