huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.8k stars 26.96k forks source link

NaN when training "t5-small" with parallelize() on multiple GPUs #21093

Closed MrZhangKY closed 1 year ago

MrZhangKY commented 1 year ago

System Info

Who can help?

@ArthurZucker @younesbelkada

Information

Tasks

Reproduction

When I using "t5-small" to generate target text from source text, If I set model.parallelize() before getting loss, the loss will be nan. But If I just set model.cuda(), the loss will be normal. Is there anything worng with the parallelize() function? Because as far as I know, pytorch does not need to do any special settings for backward() and parameters update when the model parallelism. There is a toy sample:

## Data
source = ['[ equal statistic job customer ostrich orange badger blue bull daisy giraffe hamster ivy rabbit possum whale cashew rectangle oval square pecan ] [ you orchid orange frog grey ivy racoon potato whale flax cylinder fennel pumpkin leopard ] [ desire otter mole albatross bat buffalo cat daisy grey hedgehog holly racoon squirrel potato whale apricot cylinder heart fennel raisin lilac ] [ action watch speak otter orange alligator bat bull clover daisy fish green horse racoon potato whale apricot rectangle circle fennel pecan lavender ] [ emotion ostrich orange alligator badger black cobra daisy fish green hazel horse squirrel potato wolf rectangle oval fennel pecan leopard ]', '[ problem ostrich mule badger black bull fox green hazel ivy rabbit potato whale sunflower sphere circle triangle pecan lemur ] [ company frequency time visual laws owl mule alligator badger black clover deer donkey emu grey horse racoon possum whale apricot rectangle oval fennel pecan lavender ] [ lion otter orange baboon bull cat daisy flamingo green iris racoon possum vulture apricot rectangle fennel raisin lemur ] [ equal orchid orange blossom camel chameleon deer fish green jackal possum whale rectangle oval triangle pecan ] [ equal conversation speak orchid orange bat camel deer fish grey iris racoon possum whale flax rectangle triangle pumpkin lemur ]']
target = ['Later on in a career, 300 people are clients. You have their attention. The only attention needed is to make up for attrition or to continue growth at the desired rate. With that being said, be on the lookout for me in some crazy Hawaiian shirts. Maybe a fun tie when I have to dress up.', 'To his detriment, I don’t remember anything else. I thought of the rule again a few days ago because of a Hawaiian style, Sandlot movie shirt. I was the person wearing it, and yes, I did receive all kinds of attention. Or maybe, it was the shirt. Either way people were talking to me.']

modelName = "t5-small"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(modelName, model_max_length=512)
source_tokens = [tokenizer(i) for i in source]
target_tokens = [tokenizer(i) for i in target]

# Model & Optimizer
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained(modelName)
model.parallelize()  #Model Parallelism
# model.to('cuda:0')

import torch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

## Train
if __name__ == '__main__':
    for epoch in range(10):
        for i in range(2):
            loss = model(input_ids=torch.tensor(source_tokens[i]['input_ids']).unsqueeze(0).to('cuda:0'),
                        attention_mask=torch.tensor(source_tokens[i]['attention_mask']).unsqueeze(0).to('cuda:0'),
                        labels=torch.tensor(target_tokens[i]['input_ids']).unsqueeze(0).to('cuda:0')).loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            print(loss)

Expected behavior

The loss shouldnt be nan when set model.parallelize(), it should be the same as when set model.to('cuda:0').

ArthurZucker commented 1 year ago

parallelize() is a deprecated function and should not be used. You should use the accelerate library see here

MrZhangKY commented 1 year ago

@ArthurZucker Thank you for your help. However, I use parallelize() because it can distribute layers of model to different GPUs, which is shown as followers: image

There are some models in T5ForConditionalGeneration so big (such as t5-11b is 45G), so they are cant be put on a single GPU.

But when I use accelerate to distribute the model, I find it seems uses the data parallelism, and still put all layers of a model in a single GPU. Such as followers: image

Could you please tell me how can I write code which can train the model by model parallelism? Thank you!

younesbelkada commented 1 year ago

Hi @MrZhangKY In this case you can use device_map='balanced', the script below worked for me (no NaN loss) on 2xNVIDIA T4 GPU:

## Data
source = ['[ equal statistic job customer ostrich orange badger blue bull daisy giraffe hamster ivy rabbit possum whale cashew rectangle oval square pecan ] [ you orchid orange frog grey ivy racoon potato whale flax cylinder fennel pumpkin leopard ] [ desire otter mole albatross bat buffalo cat daisy grey hedgehog holly racoon squirrel potato whale apricot cylinder heart fennel raisin lilac ] [ action watch speak otter orange alligator bat bull clover daisy fish green horse racoon potato whale apricot rectangle circle fennel pecan lavender ] [ emotion ostrich orange alligator badger black cobra daisy fish green hazel horse squirrel potato wolf rectangle oval fennel pecan leopard ]', '[ problem ostrich mule badger black bull fox green hazel ivy rabbit potato whale sunflower sphere circle triangle pecan lemur ] [ company frequency time visual laws owl mule alligator badger black clover deer donkey emu grey horse racoon possum whale apricot rectangle oval fennel pecan lavender ] [ lion otter orange baboon bull cat daisy flamingo green iris racoon possum vulture apricot rectangle fennel raisin lemur ] [ equal orchid orange blossom camel chameleon deer fish green jackal possum whale rectangle oval triangle pecan ] [ equal conversation speak orchid orange bat camel deer fish grey iris racoon possum whale flax rectangle triangle pumpkin lemur ]']
target = ['Later on in a career, 300 people are clients. You have their attention. The only attention needed is to make up for attrition or to continue growth at the desired rate. With that being said, be on the lookout for me in some crazy Hawaiian shirts. Maybe a fun tie when I have to dress up.', 'To his detriment, I don’t remember anything else. I thought of the rule again a few days ago because of a Hawaiian style, Sandlot movie shirt. I was the person wearing it, and yes, I did receive all kinds of attention. Or maybe, it was the shirt. Either way people were talking to me.']

modelName = "t5-small"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(modelName, model_max_length=512)
source_tokens = [tokenizer(i) for i in source]
target_tokens = [tokenizer(i) for i in target]

# Model & Optimizer
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained(modelName, device_map="balanced")
print(set(model.hf_device_map.values()))
# model.parallelize()  #Model Parallelism
# model.to('cuda:0')

import torch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

## Train
if __name__ == '__main__':
    for epoch in range(10):
        for i in range(2):
            loss = model(input_ids=torch.tensor(source_tokens[i]['input_ids']).unsqueeze(0),
                        attention_mask=torch.tensor(source_tokens[i]['attention_mask']).unsqueeze(0),
                        labels=torch.tensor(target_tokens[i]['input_ids']).unsqueeze(0).to(0)).loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            print(loss)

Make sure to have your model dispatched by printing set(model.hf_device_map.values()), or you can manually inspect set(model.hf_device_map) If you want to set a custom device map you can pass a dictionary such as:

custom_device_map = {
    "shared": 0,
    "encoder": 0,
    "decoder": 1,
    "decoder.embed_tokens":0,
    "lm_head": 0,
}

and pass it at initialization: model = T5ForConditionalGeneration.from_pretrained(modelName, device_map=custom_device_map). Although note that you need to manually set "decoder.embed_tokens":0, since the embed_tokens are shared between the encoder and decoder, so you need to make sure they are on the same device (maybe this can be addressed in the future but I think this is intended - otherwise you would need 2 copies of the embedding layer even though they are the same).

MrZhangKY commented 1 year ago

@younesbelkada Thank you for your help very much! However, I run the code by using 4*A6000, there are some error:

{0, 1, 2}
tensor(10.3775, grad_fn=<ToCopyBackward0>)
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 31
     29 for epoch in range(10):
     30     for i in range(2):
---> 31         loss = model(input_ids=torch.tensor(source_tokens[i]['input_ids']).unsqueeze(0),
     32                     attention_mask=torch.tensor(source_tokens[i]['attention_mask']).unsqueeze(0),
     33                     labels=torch.tensor(target_tokens[i]['input_ids']).unsqueeze(0).to(0)).loss
     34         loss.backward()
     35         optimizer.step()

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:156, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    154         output = old_forward(*args, **kwargs)
    155 else:
--> 156     output = old_forward(*args, **kwargs)
    157 return module._hf_hook.post_forward(module, output)

File /opt/conda/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1648, in T5ForConditionalGeneration.forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1645         decoder_attention_mask = decoder_attention_mask.to(self.decoder.first_device)
   1647 # Decode
-> 1648 decoder_outputs = self.decoder(
   1649     input_ids=decoder_input_ids,
   1650     attention_mask=decoder_attention_mask,
   1651     inputs_embeds=decoder_inputs_embeds,
   1652     past_key_values=past_key_values,
   1653     encoder_hidden_states=hidden_states,
   1654     encoder_attention_mask=attention_mask,
   1655     head_mask=decoder_head_mask,
   1656     cross_attn_head_mask=cross_attn_head_mask,
   1657     use_cache=use_cache,
   1658     output_attentions=output_attentions,
   1659     output_hidden_states=output_hidden_states,
   1660     return_dict=return_dict,
   1661 )
   1663 sequence_output = decoder_outputs[0]
   1665 # Set device for model parallelism

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:988, in T5Stack.forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    985 position_bias = None
    986 encoder_decoder_position_bias = None
--> 988 hidden_states = self.dropout(inputs_embeds)
    990 for i, (layer_module, past_key_value) in enumerate(zip(self.block, past_key_values)):
    991     layer_head_mask = head_mask[i]

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:151, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    149 @functools.wraps(old_forward)
    150 def new_forward(*args, **kwargs):
--> 151     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
    152     if module._hf_hook.no_grad:
    153         with torch.no_grad():

File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:266, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
    261     for name, _ in named_module_tensors(
    262         module, include_buffers=self.offload_buffers, recurse=self.place_submodules
    263     ):
    264         set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
--> 266 return send_to_device(args, self.execution_device), send_to_device(kwargs, self.execution_device)

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:131, in send_to_device(tensor, device, non_blocking)
    128 def _has_to_method(t):
    129     return hasattr(t, "to")
--> 131 return recursively_apply(_send_to_device, tensor, device, non_blocking, test_type=_has_to_method)

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:80, in recursively_apply(func, data, test_type, error_on_other_type, *args, **kwargs)
     58 """
     59 Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.
     60 
   (...)
     77     The same data structure as `data` with `func` applied to every object of type `main_type`.
     78 """
     79 if isinstance(data, (tuple, list)):
---> 80     return honor_type(
     81         data,
     82         (
     83             recursively_apply(
     84                 func, o, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
     85             )
     86             for o in data
     87         ),
     88     )
     89 elif isinstance(data, Mapping):
     90     return type(data)(
     91         {
     92             k: recursively_apply(
   (...)
     96         }
     97     )

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:51, in honor_type(obj, generator)
     47 """
     48 Cast a generator to the same type as obj (list, tuple or namedtuple)
     49 """
     50 try:
---> 51     return type(obj)(generator)
     52 except TypeError:
     53     # Some objects may not be able to instantiate from a generator directly
     54     return type(obj)(*list(generator))

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:83, in <genexpr>(.0)
     58 """
     59 Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.
     60 
   (...)
     77     The same data structure as `data` with `func` applied to every object of type `main_type`.
     78 """
     79 if isinstance(data, (tuple, list)):
     80     return honor_type(
     81         data,
     82         (
---> 83             recursively_apply(
     84                 func, o, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
     85             )
     86             for o in data
     87         ),
     88     )
     89 elif isinstance(data, Mapping):
     90     return type(data)(
     91         {
     92             k: recursively_apply(
   (...)
     96         }
     97     )

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:99, in recursively_apply(func, data, test_type, error_on_other_type, *args, **kwargs)
     90     return type(data)(
     91         {
     92             k: recursively_apply(
   (...)
     96         }
     97     )
     98 elif test_type(data):
---> 99     return func(data, *args, **kwargs)
    100 elif error_on_other_type:
    101     raise TypeError(
    102         f"Can't apply {func.__name__} on object of type {type(data)}, only of nested list/tuple/dicts of objects "
    103         f"that satisfy {test_type.__name__}."
    104     )

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:124, in send_to_device.<locals>._send_to_device(t, device, non_blocking)
    122 def _send_to_device(t, device, non_blocking):
    123     try:
--> 124         return t.to(device, non_blocking=non_blocking)
    125     except TypeError:  # .to() doesn't accept non_blocking as kwarg
    126         return t.to(device)

RuntimeError: CUDA error: device-side assert triggered

The code I run:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

## Data
source = ['[ equal statistic job customer ostrich orange badger blue bull daisy giraffe hamster ivy rabbit possum whale cashew rectangle oval square pecan ] [ you orchid orange frog grey ivy racoon potato whale flax cylinder fennel pumpkin leopard ] [ desire otter mole albatross bat buffalo cat daisy grey hedgehog holly racoon squirrel potato whale apricot cylinder heart fennel raisin lilac ] [ action watch speak otter orange alligator bat bull clover daisy fish green horse racoon potato whale apricot rectangle circle fennel pecan lavender ] [ emotion ostrich orange alligator badger black cobra daisy fish green hazel horse squirrel potato wolf rectangle oval fennel pecan leopard ]', '[ problem ostrich mule badger black bull fox green hazel ivy rabbit potato whale sunflower sphere circle triangle pecan lemur ] [ company frequency time visual laws owl mule alligator badger black clover deer donkey emu grey horse racoon possum whale apricot rectangle oval fennel pecan lavender ] [ lion otter orange baboon bull cat daisy flamingo green iris racoon possum vulture apricot rectangle fennel raisin lemur ] [ equal orchid orange blossom camel chameleon deer fish green jackal possum whale rectangle oval triangle pecan ] [ equal conversation speak orchid orange bat camel deer fish grey iris racoon possum whale flax rectangle triangle pumpkin lemur ]']
target = ['Later on in a career, 300 people are clients. You have their attention. The only attention needed is to make up for attrition or to continue growth at the desired rate. With that being said, be on the lookout for me in some crazy Hawaiian shirts. Maybe a fun tie when I have to dress up.', 'To his detriment, I don’t remember anything else. I thought of the rule again a few days ago because of a Hawaiian style, Sandlot movie shirt. I was the person wearing it, and yes, I did receive all kinds of attention. Or maybe, it was the shirt. Either way people were talking to me.']

modelName = "t5-small"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(modelName, model_max_length=512)
source_tokens = [tokenizer(i) for i in source]
target_tokens = [tokenizer(i) for i in target]

# Model & Optimizer
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained(modelName, device_map="balanced")
print(set(model.hf_device_map.values()))
# model.parallelize()  #Model Parallelism
# model.to('cuda:0')

import torch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

## Train
if __name__ == '__main__':
    for epoch in range(10):
        for i in range(2):
            loss = model(input_ids=torch.tensor(source_tokens[i]['input_ids']).unsqueeze(0),
                        attention_mask=torch.tensor(source_tokens[i]['attention_mask']).unsqueeze(0),
                        labels=torch.tensor(target_tokens[i]['input_ids']).unsqueeze(0).to(0)).loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            print(loss)

Is there any problem when using more than 2 GPUS?

younesbelkada commented 1 year ago

Interesting, can you run the same script on CPU? Whenever you have a RuntimeError: CUDA error: device-side assert triggered a good practice is to run the same script on CPU and check the error message

MrZhangKY commented 1 year ago

@younesbelkada Its strange. When I change the environment to 2 GPUS, it works....

MrZhangKY commented 1 year ago

@younesbelkada I think there are some problems when using more that 2 GPUS(for example, 4 GPUS). Do you have plans to fix this problem?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

rabdumalikov commented 1 year ago

This issue is still occuring on the newest transformers version: 4.26.1. I also managed to train on two GPU's, but when I increase number of GPU's, I get error "RuntimeError: CUDA error: device-side assert triggered".