Closed Hongjie1Chu closed 5 months ago
Could you upgrade the version of transformers you are using ? cc @SunMarc
Hi @Hongjie1Chu, please read the following concept guide and particularly the section on how to design your device_map. You need to be careful to put the layers with residual connection in the same device. What works in practice is stop at the level of a layer. For example, let's take the h.7 layer. You want to modify that:
{
'h.7.ln_1': 'cuda:0',
'h.7.attn.c_attn.bias': 'cuda:1',
'h.7.attn.c_attn.weight': 'cuda:1',
'h.7.attn.c_proj.bias': 'cuda:1',
'h.7.attn.c_proj.weight': 'cuda:1',
'h.7.attn.attn_dropout': 'cuda:0',
'h.7.attn.resid_dropout': 'cuda:0',
'h.7.ln_2': 'cuda:0'
}
to
{
'h.7':'cuda:0'
}
Otherwise, you will get an error. Also, you can also set device_map = "auto"
if you don't want to manually do the allocation.
OK, I understand, thank you. Now I have another question. Does the interface from_pretrained support cross-node inference?The above code is the inference of four cards in a single machine. If I want to implement inference on two machines with four GPUs in each machine, can I still use from_pretrained? Can DDP implement multi-machine and multi-card inference? What are the methods or interfaces to implement multi-machine and multi-card inference on the pre-trained model? @SunMarc
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.30.2Who can help?
@ArthurZucker @younesbelkada
Issue Description:
I am using
GPT2Model.from_pretrained
to load a pre-trained GPT-2 model with adevice_map
specifying the device for each operator. The model loads without any issues:However, when I attempt to get the output from the model, I encounter a RuntimeError indicating that tensors are found on at least two different devices:
This error does not occur when I assign an entire block to a single device. Is it possible that the error is caused by the granularity of my
device_map
?What I Have Tried:
device_map
correctly assigns each operator to a CUDA device.device_map
, and it worked without errors by keeping all tensors on a single device.device_map
but faced the same issue.Questions:
device_map
can be when specifying devices for individual operators?Any insights or suggestions to resolve this error would be greatly appreciated.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
from transformers import GPT2Tokenizer, GPT2Model,GPT2LMHeadModel from transformers.utils.fx import symbolic_trace import json tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2Model.from_pretrained('gpt2') print(model) text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt')
new_device_map = {'wte': 'cuda:1', 'wpe': 'cuda:1', 'drop': 'cuda:1', 'h.0.ln_1': 'cuda:1', 'h.0.attn.c_attn.bias': 'cuda:1', 'h.0.attn.c_attn.weight': 'cuda:1', 'h.0.attn.c_proj.bias': 'cuda:1', 'h.0.attn.c_proj.weight': 'cuda:1', 'h.0.attn.attn_dropout': 'cuda:1', 'h.0.attn.resid_dropout': 'cuda:1', 'h.0.ln_2': 'cuda:1', 'h.0.mlp.c_fc.bias': 'cuda:1', 'h.0.mlp.c_fc.weight': 'cuda:1', 'h.0.mlp.c_proj.bias': 'cuda:1', 'h.0.mlp.c_proj.weight': 'cuda:1', 'h.0.mlp.dropout': 'cuda:1', 'h.1.ln_1': 'cuda:1', 'h.1.attn.c_attn.bias': 'cuda:1', 'h.1.attn.c_attn.weight': 'cuda:1', 'h.1.attn.c_proj.bias': 'cuda:1', 'h.1.attn.c_proj.weight': 'cuda:1', 'h.1.attn.attn_dropout': 'cuda:1', 'h.1.attn.resid_dropout': 'cuda:1', 'h.1.ln_2': 'cuda:1', 'h.1.mlp.c_fc.bias': 'cuda:1', 'h.1.mlp.c_fc.weight': 'cuda:1', 'h.1.mlp.c_proj.bias': 'cuda:1', 'h.1.mlp.c_proj.weight': 'cuda:1', 'h.1.mlp.dropout': 'cuda:1', 'h.2.ln_1': 'cuda:1', 'h.2.attn.c_attn.bias': 'cuda:1', 'h.2.attn.c_attn.weight': 'cuda:1', 'h.2.attn.c_proj.bias': 'cuda:1', 'h.2.attn.c_proj.weight': 'cuda:1', 'h.2.attn.attn_dropout': 'cuda:1', 'h.2.attn.resid_dropout': 'cuda:1', 'h.2.ln_2': 'cuda:1', 'h.2.mlp.c_fc.bias': 'cuda:1', 'h.2.mlp.c_fc.weight': 'cuda:1', 'h.2.mlp.c_proj.bias': 'cuda:1', 'h.2.mlp.c_proj.weight': 'cuda:1', 'h.2.mlp.dropout': 'cuda:1', 'h.3.ln_1': 'cuda:1', 'h.3.attn.c_attn.bias': 'cuda:1', 'h.3.attn.c_attn.weight': 'cuda:1', 'h.3.attn.c_proj.bias': 'cuda:1', 'h.3.attn.c_proj.weight': 'cuda:1', 'h.3.attn.attn_dropout': 'cuda:1', 'h.3.attn.resid_dropout': 'cuda:1', 'h.3.ln_2': 'cuda:1', 'h.3.mlp.c_fc.bias': 'cuda:1', 'h.3.mlp.c_fc.weight': 'cuda:1', 'h.3.mlp.c_proj.bias': 'cuda:1', 'h.3.mlp.c_proj.weight': 'cuda:1', 'h.3.mlp.dropout': 'cuda:1', 'h.4.ln_1': 'cuda:1', 'h.4.attn.c_attn.bias': 'cuda:1', 'h.4.attn.c_attn.weight': 'cuda:1', 'h.4.attn.c_proj.bias': 'cuda:1', 'h.4.attn.c_proj.weight': 'cuda:1', 'h.4.attn.attn_dropout': 'cuda:0', 'h.4.attn.resid_dropout': 'cuda:0', 'h.4.ln_2': 'cuda:0', 'h.4.mlp.c_fc.bias': 'cuda:1', 'h.4.mlp.c_fc.weight': 'cuda:1', 'h.4.mlp.c_proj.bias': 'cuda:1', 'h.4.mlp.c_proj.weight': 'cuda:1', 'h.4.mlp.dropout': 'cuda:0', 'h.5.ln_1': 'cuda:0', 'h.5.attn.c_attn.bias': 'cuda:1', 'h.5.attn.c_attn.weight': 'cuda:1', 'h.5.attn.c_proj.bias': 'cuda:1', 'h.5.attn.c_proj.weight': 'cuda:1', 'h.5.attn.attn_dropout': 'cuda:0', 'h.5.attn.resid_dropout': 'cuda:0', 'h.5.ln_2': 'cuda:0', 'h.5.mlp.c_fc.bias': 'cuda:1', 'h.5.mlp.c_fc.weight': 'cuda:1', 'h.5.mlp.c_proj.bias': 'cuda:1', 'h.5.mlp.c_proj.weight': 'cuda:1', 'h.5.mlp.dropout': 'cuda:0', 'h.6.ln_1': 'cuda:0', 'h.6.attn.c_attn.bias': 'cuda:1', 'h.6.attn.c_attn.weight': 'cuda:1', 'h.6.attn.c_proj.bias': 'cuda:1', 'h.6.attn.c_proj.weight': 'cuda:1', 'h.6.attn.attn_dropout': 'cuda:0', 'h.6.attn.resid_dropout': 'cuda:0', 'h.6.ln_2': 'cuda:0', 'h.6.mlp.c_fc.bias': 'cuda:1', 'h.6.mlp.c_fc.weight': 'cuda:1', 'h.6.mlp.c_proj.bias': 'cuda:1', 'h.6.mlp.c_proj.weight': 'cuda:1', 'h.6.mlp.dropout': 'cuda:0', 'h.7.ln_1': 'cuda:0', 'h.7.attn.c_attn.bias': 'cuda:1', 'h.7.attn.c_attn.weight': 'cuda:1', 'h.7.attn.c_proj.bias': 'cuda:1', 'h.7.attn.c_proj.weight': 'cuda:1', 'h.7.attn.attn_dropout': 'cuda:0', 'h.7.attn.resid_dropout': 'cuda:0', 'h.7.ln_2': 'cuda:0', 'h.7.mlp.c_fc.bias': 'cuda:1', 'h.7.mlp.c_fc.weight': 'cuda:1', 'h.7.mlp.c_proj.bias': 'cuda:1', 'h.7.mlp.c_proj.weight': 'cuda:1', 'h.7.mlp.dropout': 'cuda:0', 'h.8.ln_1': 'cuda:0', 'h.8.attn.c_attn.bias': 'cuda:1', 'h.8.attn.c_attn.weight': 'cuda:1', 'h.8.attn.c_proj.bias': 'cuda:1', 'h.8.attn.c_proj.weight': 'cuda:1', 'h.8.attn.attn_dropout': 'cuda:0', 'h.8.attn.resid_dropout': 'cuda:0', 'h.8.ln_2': 'cuda:0', 'h.8.mlp.c_fc.bias': 'cuda:1', 'h.8.mlp.c_fc.weight': 'cuda:1', 'h.8.mlp.c_proj.bias': 'cuda:1', 'h.8.mlp.c_proj.weight': 'cuda:1', 'h.8.mlp.dropout': 'cuda:0', 'h.9.ln_1': 'cuda:0', 'h.9.attn.c_attn.bias': 'cuda:1', 'h.9.attn.c_attn.weight': 'cuda:1', 'h.9.attn.c_proj.bias': 'cuda:1', 'h.9.attn.c_proj.weight': 'cuda:1', 'h.9.attn.attn_dropout': 'cuda:0', 'h.9.attn.resid_dropout': 'cuda:0', 'h.9.ln_2': 'cuda:0', 'h.9.mlp.c_fc.bias': 'cuda:1', 'h.9.mlp.c_fc.weight': 'cuda:1', 'h.9.mlp.c_proj.bias': 'cuda:1', 'h.9.mlp.c_proj.weight': 'cuda:1', 'h.9.mlp.dropout': 'cuda:0', 'h.10.ln_1': 'cuda:0', 'h.10.attn.c_attn.bias': 'cuda:1', 'h.10.attn.c_attn.weight': 'cuda:1', 'h.10.attn.c_proj.bias': 'cuda:1', 'h.10.attn.c_proj.weight': 'cuda:1', 'h.10.attn.attn_dropout': 'cuda:0', 'h.10.attn.resid_dropout': 'cuda:0', 'h.10.ln_2': 'cuda:0', 'h.10.mlp.c_fc.bias': 'cuda:1', 'h.10.mlp.c_fc.weight': 'cuda:1', 'h.10.mlp.c_proj.bias': 'cuda:1', 'h.10.mlp.c_proj.weight': 'cuda:1', 'h.10.mlp.dropout': 'cuda:0', 'h.11.ln_1': 'cuda:0', 'h.11.attn.c_attn.bias': 'cuda:1', 'h.11.attn.c_attn.weight': 'cuda:1', 'h.11.attn.c_proj.bias': 'cuda:1', 'h.11.attn.c_proj.weight': 'cuda:1', 'h.11.attn.attn_dropout': 'cuda:0', 'h.11.attn.resid_dropout': 'cuda:0', 'h.11.ln_2': 'cuda:0', 'h.11.mlp.c_fc.bias': 'cuda:1', 'h.11.mlp.c_fc.weight': 'cuda:1', 'h.11.mlp.c_proj.bias': 'cuda:1', 'h.11.mlp.c_proj.weight': 'cuda:1', 'h.11.mlp.dropout': 'cuda:0', 'ln_f': 'cuda:0'}
model = GPT2Model.from_pretrained('gpt2',device_map=new_device_map) model=model.to("cuda") print() encoded_input=encoded_input.to("cuda") print(model(**encoded_input))
Expected behavior
I expect that by providing a detailed device_map to GPT2Model.from_pretrained, the model would be able to distribute its operations across the specified devices without encountering errors. My anticipation is that the forward pass would execute smoothly, with tensors for each operation residing on the devices I have designated in the device_map.
Specifically, with my device_map configuration, the expected outcome is that the model should generate outputs efficiently, utilizing the computational resources of multiple GPUs exactly as specified, without any RuntimeError related to device mismatches or tensor placements.
In essence, I am seeking clarity on why the current detailed device mapping results in errors and guidance on how to modify my device_map to achieve the intended multidevice execution without issues.