Using distributed or parallel set-up in script?: No
Who can help?
@amyeroberts
Information
[ ] The official example scripts
[X] My own modified scripts
Tasks
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)
Reproduction
It is not possible to train VideoMAEForPreTraining with bfloat16, because the labels are always stored as float32.
This code snippet triggers the error.
RuntimeError Traceback (most recent call last)
Cell In[1], line 20
17 outputs = model(pixel_values.to(device=model.device,dtype=model.dtype), bool_masked_pos=bool_masked_pos)
18 loss = outputs.loss
---> 20 loss.backward()
File ~/miniconda3/envs/transformers/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
482 if has_torch_function_unary(self):
483 return handle_torch_function(
484 Tensor.backward,
485 (self,),
(...)
490 inputs=inputs,
491 )
--> 492 torch.autograd.backward(
493 self, gradient, retain_graph, create_graph, inputs=inputs
494 )
File ~/miniconda3/envs/transformers/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
246 retain_graph = create_graph
248 # The reason we repeat the same comment below is that
249 # some Python versions print out the first line of a multi-line function
250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
252 tensors,
253 grad_tensors_,
254 retain_graph,
255 create_graph,
256 inputs,
257 allow_unreachable=True,
258 accumulate_grad=True,
259 )
RuntimeError: Found dtype Float but expected BFloat16
The problem is that when computing the loss, the labels are in float32 therefore, the returned loss is also in float32.
System Info
transformers
version: 4.35.0Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
It is not possible to train VideoMAEForPreTraining with bfloat16, because the labels are always stored as float32. This code snippet triggers the error.
Full TraceBack
The problem is that when computing the loss, the labels are in
float32
therefore, the returned loss is also infloat32
.Expected behavior
Labels should be converted to the same dtype as the logits.
This PR #27296 fixes the error. Altough I am not 100% sure that is the best way to handle the problem.