huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10k stars 1.27k forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #285

Closed bingjie3216 closed 1 year ago

bingjie3216 commented 1 year ago

Reproduce: torchrun reward_summarization.py

details:

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding. Could not estimate the number of tokens of the input, floating-point operations will not be computed [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jbing-gpu4/code/Users/jbing/code/trl/examples/summarization/scripts/reward_summarization.py", line 202, in trainer.train(script_args.resume_from_checkpoint) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 2663, in training_step loss.backward() File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 448, 448]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). 0%| | 0/7255 [00:04<?, ?it/s] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20477) of binary: /anaconda/envs/rlhf/bin/python Traceback (most recent call last): File "/anaconda/envs/rlhf/bin/torchrun", line 8, in sys.exit(main()) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

reward_summarization.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-04-09_04:00:55 host : localhost rank : 0 (local_rank: 0) exitcode : 1 (pid: 20477) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
bingjie3216 commented 1 year ago

The same issue happens if I try to run the stack llama reward python code:

To reproduce: torchrun --nnodes 1 --nproc_per_node 8 examples/stack_llama/scripts/reward_modeling.py --model_name=gpt2-medium

Details: Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jbing-gpu5/code/Users/jbing/code/trl/examples/stack_llama/scripts/reward_modeling.py", line 283, in trainer.train(script_args.resume_from_checkpoint) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 2655, in training_step self.scaler.scale(loss).backward() File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 397, 397]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

rz-zhang commented 1 year ago

I got the same problem here. Have you solved it?

rumsyx commented 1 year ago

same here.

younesbelkada commented 1 year ago

Hi everyone, Do you face the same issue when running the script with accelerate launch ... ? First run:

accelerate config

follow the instructions, then:

accelerate launch reward_summarization.py
dayL-W commented 1 year ago

same here.

oliu-io commented 1 year ago

You can checkout my workaround here: https://github.com/lvwerra/trl/issues/274#issuecomment-1562135869

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.