Error running inference on CogVLM2 when distributing it on multiple GPUs: Expected all tensors to be on the same device, but found at least two devices

System Info

transformers: 4.40.2
platform: Amazon Sagemaker ml.g4dn.12xlarge
huggingface_hub: 0.23.0
accelerate: 0.21.0
torch: 2.3.0
torch_vision: 0.15.2a0+ab7b3e6
einops: 0.8.0
xformers: 0.0.27.dev841

Who can help?

@ArthurZucker @amyeroberts @Narsil @muellerzr @SunMarc

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import requests
import torch

from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map

MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
    0] >= 8 else torch.float16

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
    device_map='auto'
).eval()

text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"

while True:

    url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"

    image = Image.open(requests.get(url, stream=True).raw)
    history = []

    while True:
        query = input("Human:")
        if query == "clear":
            break

        input_by_model = model.build_conversation_input_ids(
                tokenizer,
                query=query,
                history=history,
                images=[image],
                template_version='chat'
            )
        inputs = {
            'input_ids': input_by_model['input_ids'].unsqueeze(0).to("cuda"),
            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to("cuda"),
            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to("cuda"),
            'images': [[input_by_model['images'][0].to("cuda").to(TORCH_TYPE)]],
        }
        gen_kwargs = {
            "max_new_tokens": 2048,
            "pad_token_id": 128002,  
        }
        with torch.no_grad():
            outputs = model.generate(**inputs, **gen_kwargs)
            outputs = outputs[:, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(outputs[0])
            response = response.split("<|end_of_text|>")[0]
            print("\nCogVLM2:", response)
        history.append((query, response))

Human input when running the code: "please describe this image"

Expected behavior

It should be able to distribute the model on multiple GPU cards and run inference when data is only on one card, and generate the caption for each human prompt, but I get the following error: (I also tried defining my own device map instead of using 'auto' similar to here, but it gives the same error)

> RuntimeError                              Traceback (most recent call last)
> Cell In[1], line 77
>      72 gen_kwargs = {
>      73     "max_new_tokens": 2048,
>      74     "pad_token_id": 128002,  
>      75 }
>      76 with torch.no_grad():
> ---> 77     outputs = model.generate(**inputs, **gen_kwargs)
>      78     outputs = outputs[:, inputs['input_ids'].shape[1]:]
>      79     response = tokenizer.decode(outputs[0])
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:115](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator.<locals>.decorate_context(*args, **kwargs)
>     112 @functools.wraps(func)
>     113 def decorate_context(*args, **kwargs):
>     114     with ctx_factory():
> --> 115         return func(*args, **kwargs)
> 
> File [/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1622](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py#line=1621), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
>    1614     input_ids, model_kwargs = self._expand_inputs_for_generation(
>    1615         input_ids=input_ids,
>    1616         expand_size=generation_config.num_return_sequences,
>    1617         is_encoder_decoder=self.config.is_encoder_decoder,
>    1618         **model_kwargs,
>    1619     )
>    1621     # 13. run sample
> -> 1622     result = self._sample(
>    1623         input_ids,
>    1624         logits_processor=prepared_logits_processor,
>    1625         logits_warper=logits_warper,
>    1626         stopping_criteria=prepared_stopping_criteria,
>    1627         pad_token_id=generation_config.pad_token_id,
>    1628         output_scores=generation_config.output_scores,
>    1629         output_logits=generation_config.output_logits,
>    1630         return_dict_in_generate=generation_config.return_dict_in_generate,
>    1631         synced_gpus=synced_gpus,
>    1632         streamer=streamer,
>    1633         **model_kwargs,
>    1634     )
>    1636 elif generation_mode == GenerationMode.BEAM_SEARCH:
>    1637     # 11. prepare beam search scorer
>    1638     beam_scorer = BeamSearchScorer(
>    1639         batch_size=batch_size,
>    1640         num_beams=generation_config.num_beams,
>    (...)
>    1645         max_length=generation_config.max_length,
>    1646     )
> 
> File [/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:2791](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py#line=2790), in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, output_logits, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
>    2788 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
>    2790 # forward pass to get next token
> -> 2791 outputs = self(
>    2792     **model_inputs,
>    2793     return_dict=True,
>    2794     output_attentions=output_attentions,
>    2795     output_hidden_states=output_hidden_states,
>    2796 )
>    2798 if synced_gpus and this_peer_finished:
>    2799     continue  # don't waste resources running the code we don't need
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
>    1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
>    1531 else:
> -> 1532     return self._call_impl(*args, **kwargs)
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
>    1536 # If we don't have any hooks, we want to skip the rest of the logic in
>    1537 # this function, and just call forward.
>    1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
>    1539         or _global_backward_pre_hooks or _global_backward_hooks
>    1540         or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541     return forward_call(*args, **kwargs)
>    1543 try:
>    1544     result = None
> 
> File [/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py#line=164), in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
>     163         output = old_forward(*args, **kwargs)
>     164 else:
> --> 165     output = old_forward(*args, **kwargs)
>     166 return module._hf_hook.post_forward(module, output)
> 
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py:649](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py#line=648), in CogVLMForCausalLM.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, labels)
>     646 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
>     648 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
> --> 649 outputs = self.model(
>     650     input_ids=input_ids,
>     651     images=images,
>     652     token_type_ids=token_type_ids,
>     653     attention_mask=attention_mask,
>     654     position_ids=position_ids,
>     655     past_key_values=past_key_values,
>     656     inputs_embeds=inputs_embeds,
>     657     use_cache=use_cache,
>     658     output_attentions=output_attentions,
>     659     output_hidden_states=output_hidden_states,
>     660     return_dict=return_dict,
>     661 )
>     663 hidden_states = outputs[0]
>     664 logits = self.lm_head(hidden_states)
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
>    1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
>    1531 else:
> -> 1532     return self._call_impl(*args, **kwargs)
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
>    1536 # If we don't have any hooks, we want to skip the rest of the logic in
>    1537 # this function, and just call forward.
>    1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
>    1539         or _global_backward_pre_hooks or _global_backward_hooks
>    1540         or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541     return forward_call(*args, **kwargs)
>    1543 try:
>    1544     result = None
> 
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py:390](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py#line=389), in CogVLMModel.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
>     388 assert len(input_ids) == len(images), f"{len(input_ids)} {len(images)}"
>     389 inputs_embeds = self.embed_tokens(input_ids)
> --> 390 images_features = self.encode_images(images)
>     391 images_features = rearrange(images_features, 'b n d -> (b n) d')
>     392 images_features = images_features.to(dtype=inputs_embeds.dtype, device=inputs_embeds.device)
> 
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py:362](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py#line=361), in CogVLMModel.encode_images(self, images)
>     359         images.append(image)
>     361 images = torch.stack(images)
> --> 362 images_features = self.vision(images)
>     363 return images_features
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
>    1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
>    1531 else:
> -> 1532     return self._call_impl(*args, **kwargs)
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
>    1536 # If we don't have any hooks, we want to skip the rest of the logic in
>    1537 # this function, and just call forward.
>    1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
>    1539         or _global_backward_pre_hooks or _global_backward_hooks
>    1540         or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541     return forward_call(*args, **kwargs)
>    1543 try:
>    1544     result = None
> 
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py:130](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py#line=129), in EVA2CLIPModel.forward(self, images)
>     128 def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
>     129     x = self.patch_embedding(images)
> --> 130     x = self.transformer(x)
>     131     x = x[:, 1:]
>     133     b, s, h = x.shape
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
>    1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
>    1531 else:
> -> 1532     return self._call_impl(*args, **kwargs)
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
>    1536 # If we don't have any hooks, we want to skip the rest of the logic in
>    1537 # this function, and just call forward.
>    1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
>    1539         or _global_backward_pre_hooks or _global_backward_hooks
>    1540         or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541     return forward_call(*args, **kwargs)
>    1543 try:
>    1544     result = None
> 
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py:94](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py#line=93), in Transformer.forward(self, hidden_states)
>      92 def forward(self, hidden_states):
>      93     for layer_module in self.layers:
> ---> 94         hidden_states = layer_module(hidden_states)
>      95     return hidden_states
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
>    1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
>    1531 else:
> -> 1532     return self._call_impl(*args, **kwargs)
> 
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
>    1536 # If we don't have any hooks, we want to skip the rest of the logic in
>    1537 # this function, and just call forward.
>    1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
>    1539         or _global_backward_pre_hooks or _global_backward_hooks
>    1540         or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541     return forward_call(*args, **kwargs)
>    1543 try:
>    1544     result = None
> 
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py:83](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py#line=82), in TransformerLayer.forward(self, hidden_states)
>      81 mlp_input = hidden_states
>      82 mlp_output = self.post_attention_layernorm(self.mlp(mlp_input))
> ---> 83 output = mlp_input + mlp_output
>      84 return output
> 
> RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3!

huggingface / transformers

Error running inference on CogVLM2 when distributing it on multiple GPUs: Expected all tensors to be on the same device, but found at least two devices #31676