huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.02k stars 1.27k forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274

Closed oroojlooy closed 11 months ago

oroojlooy commented 1 year ago

I am getting the following error traceback when I run python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16 on a machine with two nodes of A10 (24GB). I have torch==2.0.0 installed.

I appreciate any comment/idea to fix that.

Traceback (most recent call last):
  File "/home/opc/trl/examples/summarization/scripts/reward_summarization.py", line 202, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2663, in training_step
    loss.backward()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 377, 377]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/opc/trl/examples/summarization/scripts/wandb/offline-run-20230404_175237-0r3498mc
wandb: Find logs at: ./wandb/offline-run-20230404_175237-0r3498mc/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1902146) of binary: /home/opc/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
reward_summarization.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-04_17:52:47
  host      : instance-20230329-1307.subnet03291319.vcn03291319.oraclevcn.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1902146)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
younesbelkada commented 1 year ago

Hi @oroojlooy ! Thanks for the issue, I think you should rather run the script with accelerate launch, first run:

accelerate config

And make sure to select multi-node setup! cc @lvwerra that has some experience in multi-node training using trl

oroojlooy commented 1 year ago

Hi @younesbelkada! I am not using both gpus, so was not sure if I need to utilize accelerate launch. I am getting the error with --nproc_per_node=1. Also, I got the run command from the README of the corresponding example in TRL package.

Do you think the issue is because I have two gpu nodes available in the machine? If so, does setting CUDA_VISIBLE_DEVICES would help?

bingjie3216 commented 1 year ago

I don't think it is related to accelerate launch, I met the same issue while using GPT2 or GPT2-medium models.

oroojlooy commented 1 year ago

@bingjie3216 @lvwerra @younesbelkada Would you mind sharing the version of python packages (like torch, accelerate, deepspeed, transfomers, etc) that you have and TRL examples work for you?

seirasto commented 1 year ago

I am also running into this error with reward_summarization.py using the following command:

python -m torch.distributed.run --nproc_per_node=1 /dccstor/srosent2/trl/trl/examples/summarization/scripts/reward_summarization.py --bf16

python=3.10.0, torch=2.0.0, transformers=4.28.1, cuda 12

I enabled anomaly detection and it complained about this line in modeling_gpt2.py

line 201: attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)

Any suggestions?

oroojlooy commented 1 year ago

@seirasto torch.autograd.set_detect_anomaly(True) gives the same line to me as well.

seirasto commented 1 year ago

It looks like we are facing the exact same issue - are you using all the same version of packages? It would be great if someone could share with us ones that work.

oroojlooy commented 1 year ago

I running it in python 3.8.16 and cuda 11.7. My package versions are:

seirasto commented 1 year ago

I was able to get around the bug by modifying the problematic line in modeling_gpt2.py to use clone() so no inplace operations are occurring in modeling_gpt2.py:

attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value)

oroojlooy commented 1 year ago

@seirasto Thanks for letting me know! Do you have any intuition why the clone() on causal_mask() is required? That does not have any relationship with attn_weights so it seems that it should not affect the gradient in there.

seirasto commented 1 year ago

No, but I tried with clone() just on attn_weights and it didn't work. I haven't tried doing the clone() for just causal_mask()

oroojlooy commented 1 year ago

I tried it on everything except causal_mask() and it did not work. That is why I asked for the intuition about that.

dayL-W commented 1 year ago

same error

oliu-io commented 1 year ago

I don't have a clear understanding to the cause of this issue per se, but the problem is derived from the fact that we run two forward passes (for rewards_j and rewards_k respectively) to compute the loss function, and somehow GPT's doesn't like that. Here's a minimal workaround that doesn't involve making changes to transformers.models:

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

garrett361 commented 1 year ago

Just noting that I am also hitting the same in-place issues with the same models, and (very oddly) that is only happens when using DDP. Single-GPU, single-node raises no error.

younesbelkada commented 1 year ago

Planning to deep dive in the next weeks about issues with respect to distributed training, assigning this to myself

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

younesbelkada commented 1 year ago

Hi there, I believe this is now fixed on transformers, trl and peft main, please have a look at this comment: https://github.com/huggingface/trl/issues/835#issuecomment-1787704901 on how to fix the issue