Open oroojlooy opened 1 year ago
I cannot transfer the issue to the trl
repo but it should be opened there since the bug is in their example.
@sgugger I already have posted it there, and it seems that the issue is not on TRL side.
torch.autograd.set_detect_anomaly(True)
reports that the root of issue might be in line 201 in site-packages/transformers/models/gpt2/modeling_gpt2.py
Turned out that modifying line 201 as below solves the issue.
attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value)
Remember that it was:
attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
@sgugger Do you know if it is a safe modification?
This will break the flow of the gradients from the attention weights, so no it's a good fix.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Any update on this? I am having the same issue
I'm experiencing same issue with WhisperModel
Actually according to torch
, the clone()
operation is not breaking the flow of the gradient. see here:
This function is differentiable, so gradients will flow back from the result of this operation to input. To create a tensor without an autograd relationship to input see detach().
Apparently, previous torch version did not check for these, but gradients were wrong (source is a lost stack overflow thread), there are at least 5 more issues linked to this one: #25130, #22225, #15677, #14179, #24996, #23087. Now wether this was fixed in the latest versions of torch or not is also a question, but all these issues use FSDP.
Every inplace operation seems to be causing this. But we have a lot of these 😓 cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible.
This wrapper : https://github.com/pytorch/pytorch/blob/main/torch/autograd/graph.py#L508 seems to add clone()
wherever its needed. Might be something to do there?
We should also PIN the issue to redirect everyone that has FSDP + inplace operation issue.
Also removing all inplace operations might make the memory used a bit higher, so would love if there was an alternative solution for FSDP/
I'm hitting the same issue, while trying to get the gpt2 embeddings of target via the following call:
self.gpt2.transformer.wte(target)
Error message:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
However, I did a trick like below, it succeeded.
self.gpt2.transformer.wte(target.clone())
BTW, gpt2 model is set on evaluation mode. self.gpt2.eval()
Hello,
cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible
I don't have any recommendations at present other than replacing in place operations. Let me try this example once to see if this persists with the latest PyTorch version.
Will mark as WIP as this is not something we are working on
The error is triggered by DDP buffer broadcasting mechanism.
We need to set broadcast_buffers=False
to avoid it.
model = torch.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False, ...)
System Info
transformers 4.28.1 torch 2.0.0 torchaudio 2.0.0 torchvision 0.15.0 huggingface-hub 0.13.4 trl 0.4.2.dev0
Who can help?
Probably people from accelerate, trainer, and text: @pacman100, @sgugger, @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
trl/examples/summarization/scripts
accelerate config
like thisaccelerate launch reward_summarization.py
This results in the following error:
Expected behavior
I expect it should run fine, but it ends in that error. Although it is not a native huggingFace code, it seems that it the issue is from the gpt2 trainer code.