Closed MrZhangKY closed 1 year ago
parallelize()
is a deprecated function and should not be used. You should use the accelerate
library see here
@ArthurZucker Thank you for your help. However, I use parallelize() because it can distribute layers of model to different GPUs, which is shown as followers:
There are some models in T5ForConditionalGeneration so big (such as t5-11b is 45G), so they are cant be put on a single GPU.
But when I use accelerate to distribute the model, I find it seems uses the data parallelism, and still put all layers of a model in a single GPU. Such as followers:
Could you please tell me how can I write code which can train the model by model parallelism? Thank you!
Hi @MrZhangKY
In this case you can use device_map='balanced'
, the script below worked for me (no NaN loss) on 2xNVIDIA T4 GPU:
## Data
source = ['[ equal statistic job customer ostrich orange badger blue bull daisy giraffe hamster ivy rabbit possum whale cashew rectangle oval square pecan ] [ you orchid orange frog grey ivy racoon potato whale flax cylinder fennel pumpkin leopard ] [ desire otter mole albatross bat buffalo cat daisy grey hedgehog holly racoon squirrel potato whale apricot cylinder heart fennel raisin lilac ] [ action watch speak otter orange alligator bat bull clover daisy fish green horse racoon potato whale apricot rectangle circle fennel pecan lavender ] [ emotion ostrich orange alligator badger black cobra daisy fish green hazel horse squirrel potato wolf rectangle oval fennel pecan leopard ]', '[ problem ostrich mule badger black bull fox green hazel ivy rabbit potato whale sunflower sphere circle triangle pecan lemur ] [ company frequency time visual laws owl mule alligator badger black clover deer donkey emu grey horse racoon possum whale apricot rectangle oval fennel pecan lavender ] [ lion otter orange baboon bull cat daisy flamingo green iris racoon possum vulture apricot rectangle fennel raisin lemur ] [ equal orchid orange blossom camel chameleon deer fish green jackal possum whale rectangle oval triangle pecan ] [ equal conversation speak orchid orange bat camel deer fish grey iris racoon possum whale flax rectangle triangle pumpkin lemur ]']
target = ['Later on in a career, 300 people are clients. You have their attention. The only attention needed is to make up for attrition or to continue growth at the desired rate. With that being said, be on the lookout for me in some crazy Hawaiian shirts. Maybe a fun tie when I have to dress up.', 'To his detriment, I don’t remember anything else. I thought of the rule again a few days ago because of a Hawaiian style, Sandlot movie shirt. I was the person wearing it, and yes, I did receive all kinds of attention. Or maybe, it was the shirt. Either way people were talking to me.']
modelName = "t5-small"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(modelName, model_max_length=512)
source_tokens = [tokenizer(i) for i in source]
target_tokens = [tokenizer(i) for i in target]
# Model & Optimizer
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained(modelName, device_map="balanced")
print(set(model.hf_device_map.values()))
# model.parallelize() #Model Parallelism
# model.to('cuda:0')
import torch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
## Train
if __name__ == '__main__':
for epoch in range(10):
for i in range(2):
loss = model(input_ids=torch.tensor(source_tokens[i]['input_ids']).unsqueeze(0),
attention_mask=torch.tensor(source_tokens[i]['attention_mask']).unsqueeze(0),
labels=torch.tensor(target_tokens[i]['input_ids']).unsqueeze(0).to(0)).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(loss)
Make sure to have your model dispatched by printing set(model.hf_device_map.values())
, or you can manually inspect set(model.hf_device_map)
If you want to set a custom device map you can pass a dictionary such as:
custom_device_map = {
"shared": 0,
"encoder": 0,
"decoder": 1,
"decoder.embed_tokens":0,
"lm_head": 0,
}
and pass it at initialization: model = T5ForConditionalGeneration.from_pretrained(modelName, device_map=custom_device_map)
. Although note that you need to manually set "decoder.embed_tokens":0,
since the embed_tokens
are shared between the encoder and decoder, so you need to make sure they are on the same device (maybe this can be addressed in the future but I think this is intended - otherwise you would need 2 copies of the embedding layer even though they are the same).
@younesbelkada Thank you for your help very much! However, I run the code by using 4*A6000, there are some error:
{0, 1, 2}
tensor(10.3775, grad_fn=<ToCopyBackward0>)
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [67,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[1], line 31
29 for epoch in range(10):
30 for i in range(2):
---> 31 loss = model(input_ids=torch.tensor(source_tokens[i]['input_ids']).unsqueeze(0),
32 attention_mask=torch.tensor(source_tokens[i]['attention_mask']).unsqueeze(0),
33 labels=torch.tensor(target_tokens[i]['input_ids']).unsqueeze(0).to(0)).loss
34 loss.backward()
35 optimizer.step()
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:156, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
154 output = old_forward(*args, **kwargs)
155 else:
--> 156 output = old_forward(*args, **kwargs)
157 return module._hf_hook.post_forward(module, output)
File /opt/conda/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1648, in T5ForConditionalGeneration.forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1645 decoder_attention_mask = decoder_attention_mask.to(self.decoder.first_device)
1647 # Decode
-> 1648 decoder_outputs = self.decoder(
1649 input_ids=decoder_input_ids,
1650 attention_mask=decoder_attention_mask,
1651 inputs_embeds=decoder_inputs_embeds,
1652 past_key_values=past_key_values,
1653 encoder_hidden_states=hidden_states,
1654 encoder_attention_mask=attention_mask,
1655 head_mask=decoder_head_mask,
1656 cross_attn_head_mask=cross_attn_head_mask,
1657 use_cache=use_cache,
1658 output_attentions=output_attentions,
1659 output_hidden_states=output_hidden_states,
1660 return_dict=return_dict,
1661 )
1663 sequence_output = decoder_outputs[0]
1665 # Set device for model parallelism
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:988, in T5Stack.forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
985 position_bias = None
986 encoder_decoder_position_bias = None
--> 988 hidden_states = self.dropout(inputs_embeds)
990 for i, (layer_module, past_key_value) in enumerate(zip(self.block, past_key_values)):
991 layer_head_mask = head_mask[i]
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:151, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
149 @functools.wraps(old_forward)
150 def new_forward(*args, **kwargs):
--> 151 args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
152 if module._hf_hook.no_grad:
153 with torch.no_grad():
File /opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:266, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
261 for name, _ in named_module_tensors(
262 module, include_buffers=self.offload_buffers, recurse=self.place_submodules
263 ):
264 set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
--> 266 return send_to_device(args, self.execution_device), send_to_device(kwargs, self.execution_device)
File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:131, in send_to_device(tensor, device, non_blocking)
128 def _has_to_method(t):
129 return hasattr(t, "to")
--> 131 return recursively_apply(_send_to_device, tensor, device, non_blocking, test_type=_has_to_method)
File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:80, in recursively_apply(func, data, test_type, error_on_other_type, *args, **kwargs)
58 """
59 Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.
60
(...)
77 The same data structure as `data` with `func` applied to every object of type `main_type`.
78 """
79 if isinstance(data, (tuple, list)):
---> 80 return honor_type(
81 data,
82 (
83 recursively_apply(
84 func, o, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
85 )
86 for o in data
87 ),
88 )
89 elif isinstance(data, Mapping):
90 return type(data)(
91 {
92 k: recursively_apply(
(...)
96 }
97 )
File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:51, in honor_type(obj, generator)
47 """
48 Cast a generator to the same type as obj (list, tuple or namedtuple)
49 """
50 try:
---> 51 return type(obj)(generator)
52 except TypeError:
53 # Some objects may not be able to instantiate from a generator directly
54 return type(obj)(*list(generator))
File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:83, in <genexpr>(.0)
58 """
59 Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.
60
(...)
77 The same data structure as `data` with `func` applied to every object of type `main_type`.
78 """
79 if isinstance(data, (tuple, list)):
80 return honor_type(
81 data,
82 (
---> 83 recursively_apply(
84 func, o, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs
85 )
86 for o in data
87 ),
88 )
89 elif isinstance(data, Mapping):
90 return type(data)(
91 {
92 k: recursively_apply(
(...)
96 }
97 )
File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:99, in recursively_apply(func, data, test_type, error_on_other_type, *args, **kwargs)
90 return type(data)(
91 {
92 k: recursively_apply(
(...)
96 }
97 )
98 elif test_type(data):
---> 99 return func(data, *args, **kwargs)
100 elif error_on_other_type:
101 raise TypeError(
102 f"Can't apply {func.__name__} on object of type {type(data)}, only of nested list/tuple/dicts of objects "
103 f"that satisfy {test_type.__name__}."
104 )
File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:124, in send_to_device.<locals>._send_to_device(t, device, non_blocking)
122 def _send_to_device(t, device, non_blocking):
123 try:
--> 124 return t.to(device, non_blocking=non_blocking)
125 except TypeError: # .to() doesn't accept non_blocking as kwarg
126 return t.to(device)
RuntimeError: CUDA error: device-side assert triggered
The code I run:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
## Data
source = ['[ equal statistic job customer ostrich orange badger blue bull daisy giraffe hamster ivy rabbit possum whale cashew rectangle oval square pecan ] [ you orchid orange frog grey ivy racoon potato whale flax cylinder fennel pumpkin leopard ] [ desire otter mole albatross bat buffalo cat daisy grey hedgehog holly racoon squirrel potato whale apricot cylinder heart fennel raisin lilac ] [ action watch speak otter orange alligator bat bull clover daisy fish green horse racoon potato whale apricot rectangle circle fennel pecan lavender ] [ emotion ostrich orange alligator badger black cobra daisy fish green hazel horse squirrel potato wolf rectangle oval fennel pecan leopard ]', '[ problem ostrich mule badger black bull fox green hazel ivy rabbit potato whale sunflower sphere circle triangle pecan lemur ] [ company frequency time visual laws owl mule alligator badger black clover deer donkey emu grey horse racoon possum whale apricot rectangle oval fennel pecan lavender ] [ lion otter orange baboon bull cat daisy flamingo green iris racoon possum vulture apricot rectangle fennel raisin lemur ] [ equal orchid orange blossom camel chameleon deer fish green jackal possum whale rectangle oval triangle pecan ] [ equal conversation speak orchid orange bat camel deer fish grey iris racoon possum whale flax rectangle triangle pumpkin lemur ]']
target = ['Later on in a career, 300 people are clients. You have their attention. The only attention needed is to make up for attrition or to continue growth at the desired rate. With that being said, be on the lookout for me in some crazy Hawaiian shirts. Maybe a fun tie when I have to dress up.', 'To his detriment, I don’t remember anything else. I thought of the rule again a few days ago because of a Hawaiian style, Sandlot movie shirt. I was the person wearing it, and yes, I did receive all kinds of attention. Or maybe, it was the shirt. Either way people were talking to me.']
modelName = "t5-small"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(modelName, model_max_length=512)
source_tokens = [tokenizer(i) for i in source]
target_tokens = [tokenizer(i) for i in target]
# Model & Optimizer
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained(modelName, device_map="balanced")
print(set(model.hf_device_map.values()))
# model.parallelize() #Model Parallelism
# model.to('cuda:0')
import torch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
## Train
if __name__ == '__main__':
for epoch in range(10):
for i in range(2):
loss = model(input_ids=torch.tensor(source_tokens[i]['input_ids']).unsqueeze(0),
attention_mask=torch.tensor(source_tokens[i]['attention_mask']).unsqueeze(0),
labels=torch.tensor(target_tokens[i]['input_ids']).unsqueeze(0).to(0)).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(loss)
Is there any problem when using more than 2 GPUS?
Interesting, can you run the same script on CPU?
Whenever you have a RuntimeError: CUDA error: device-side assert triggered
a good practice is to run the same script on CPU and check the error message
@younesbelkada Its strange. When I change the environment to 2 GPUS, it works....
@younesbelkada I think there are some problems when using more that 2 GPUS(for example, 4 GPUS). Do you have plans to fix this problem?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue is still occuring on the newest transformers version: 4.26.1. I also managed to train on two GPU's, but when I increase number of GPU's, I get error "RuntimeError: CUDA error: device-side assert triggered".
System Info
Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I using "t5-small" to generate target text from source text, If I set
model.parallelize()
before getting loss, the loss will benan
. But If I just setmodel.cuda()
, the loss will be normal. Is there anything worng with theparallelize()
function? Because as far as I know, pytorch does not need to do any special settings for backward() and parameters update when the model parallelism. There is a toy sample:Expected behavior
The loss shouldnt be nan when set
model.parallelize()
, it should be the same as when setmodel.to('cuda:0')
.