RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

oroojlooy commented 1 year ago

System Info

transformers 4.28.1 torch 2.0.0 torchaudio 2.0.0 torchvision 0.15.0 huggingface-hub 0.13.4 trl 0.4.2.dev0

Who can help?

Probably people from accelerate, trainer, and text: @pacman100, @sgugger, @ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Install the TRL package from (https://github.com/lvwerra/trl)
Clone the package and go to trl/examples/summarization/scripts

Setup accelerate config like this

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: GPT2Block
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

call accelerate launch reward_summarization.py

This results in the following error:

/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Error detected in WhereBackward0. Traceback of forward call that caused the error:
  File "reward_summarization.py", line 203, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 2699, in training_step
    loss = self.compute_loss(model, inputs)
  File "reward_summarization.py", line 185, in compute_loss
    rewards_j = model(input_ids=inputs["input_ids_j"], attention_mask=inputs["attention_mask_j"])[0]
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1420, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
    outputs = block(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 389, in forward
    attn_outputs = self.attn(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 330, in forward
    attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 201, in _attn
    attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
 (Triggered internally at /opt/conda/conda-bld/pytorch_1678402379298/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "reward_summarization.py", line 203, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/transformers/trainer.py", line 2717, in training_step
    loss.backward()
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/ubuntu/miniconda3/envs/trl/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 385, 385]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Expected behavior

I expect it should run fine, but it ends in that error. Although it is not a native huggingFace code, it seems that it the issue is from the gpt2 trainer code.

sgugger commented 1 year ago

I cannot transfer the issue to the trl repo but it should be opened there since the bug is in their example.

oroojlooy commented 1 year ago

@sgugger I already have posted it there, and it seems that the issue is not on TRL side.

oroojlooy commented 1 year ago

torch.autograd.set_detect_anomaly(True) reports that the root of issue might be in line 201 in site-packages/transformers/models/gpt2/modeling_gpt2.py

oroojlooy commented 1 year ago

Turned out that modifying line 201 as below solves the issue. attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value) Remember that it was: attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)

@sgugger Do you know if it is a safe modification?

sgugger commented 1 year ago

This will break the flow of the gradients from the attention weights, so no it's a good fix.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mukhal commented 1 year ago

Any update on this? I am having the same issue

pfeatherstone commented 1 year ago

I'm experiencing same issue with WhisperModel

ArthurZucker commented 1 year ago

Actually according to torch, the clone() operation is not breaking the flow of the gradient. see here:

This function is differentiable, so gradients will flow back from the result of this operation to input. To create a tensor without an autograd relationship to input see detach().

Apparently, previous torch version did not check for these, but gradients were wrong (source is a lost stack overflow thread), there are at least 5 more issues linked to this one: #25130, #22225, #15677, #14179, #24996, #23087. Now wether this was fixed in the latest versions of torch or not is also a question, but all these issues use FSDP.

Every inplace operation seems to be causing this. But we have a lot of these 😓 cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible.

This wrapper : https://github.com/pytorch/pytorch/blob/main/torch/autograd/graph.py#L508 seems to add clone() wherever its needed. Might be something to do there?

We should also PIN the issue to redirect everyone that has FSDP + inplace operation issue.

sgugger commented 1 year ago

Also removing all inplace operations might make the memory used a bit higher, so would love if there was an alternative solution for FSDP/

kevin-s-wang commented 10 months ago

I'm hitting the same issue, while trying to get the gpt2 embeddings of target via the following call:

self.gpt2.transformer.wte(target)

Error message:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:

However, I did a trick like below, it succeeded.

self.gpt2.transformer.wte(target.clone())

BTW, gpt2 model is set on evaluation mode. self.gpt2.eval()

pacman100 commented 10 months ago

Hello,

cc @pacman100 wondering what you would recommend? Should we make everything compatible removing inplace operations? Seems kind of impractible

I don't have any recommendations at present other than replacing in place operations. Let me try this example once to see if this persists with the latest PyTorch version.

ArthurZucker commented 9 months ago

Will mark as WIP as this is not something we are working on

nguyentanthong commented 8 months ago

The error is triggered by DDP buffer broadcasting mechanism. We need to set broadcast_buffers=False to avoid it.

model = torch.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False, ...)

huggingface / transformers