Closed harrison4ride closed 10 months ago
Hi @harrison4ride What is the TRL version you are using and can you share the full traceback of the issue? 🙏
Thank you for your reply. My TRL version is 0.7.1 and full traceback is:
Traceback (most recent call last):
File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
Traceback (most recent call last):
File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
Traceback (most recent call last):
File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
main()main()main()
Traceback (most recent call last):
main() File "/home/hxxzhang/structureLLM/ppo_single.py", line 232, in <module>
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
main()
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
main()
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
main()
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
main()
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
File "/home/hxxzhang/structureLLM/ppo_single.py", line 225, in main
^ ^ ^ ^^ ^ ^^^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^ ^^^ ^^ ^^^ ^^ ^^^ ^ ^ ^^^ ^^stats = ppo_trainer.step(query_tensors, response_tensors, rewards)^ ^^ ^^
^ ^^^ ^^^ ^^ ^^^ ^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^
^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
^
^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
^^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
^^^
^^^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
^^^^^^^^^^^^^^^
^^
^^ ^return func(*args, **kwds) File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
^ File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
^^ ^return func(*args, **kwds)return func(*args, **kwds)^
^^^^^ ^return func(*args, **kwds)^
^
File "/home/hxxzhang/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^ ^ return func(*args, **kwds)^
^ ^ return func(*args, **kwds)^ ^^
^ ^^^^^ ^ ^^ ^^ ^^^ ^ return func(*args, **kwds)^^^ ^ ^
^^^ ^^^^ ^^^^ ^^^^ ^ ^^^^ ^ ^^^^ ^ ^^^^^ ^ ^^^^ ^ ^ ^^^^ ^ ^^^
^^^ ^^^^^^^ ^^
^
^ ^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
^^^ ^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
^ ^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
^^^^^^^^^^
^^^^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
^^^^^^^^^
^^^^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
^^^^^^^^^^^^
^
^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 802, in step
stats = self.gather_stats(stats)
stats = self.gather_stats(stats)
stats = self.gather_stats(stats)
stats = self.gather_stats(stats)
stats = self.gather_stats(stats)
^ ^ ^ ^^^ ^^^^ stats = self.gather_stats(stats)^^^
^^^^^ stats = self.gather_stats(stats)^^^^ stats = self.gather_stats(stats)
^^^
^^^ ^^^ ^^^ ^^^^ ^^^^ ^^^^ ^ ^ ^^ ^ ^ ^ ^ ^ ^ ^ ^^ ^ ^ ^^^^ ^ ^^^^^^ ^^^^^^^ ^^^^^^^ ^
^^^^ ^^^
^ ^
^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
^ ^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
^^^^^^^^^^^^^^^^^^
^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
^^^^^^^
^^
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 879, in gather_stats
dist.all_reduce(v, dist.ReduceOp.SUM)dist.all_reduce(v, dist.ReduceOp.SUM)
dist.all_reduce(v, dist.ReduceOp.SUM)
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
dist.all_reduce(v, dist.ReduceOp.SUM)
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
dist.all_reduce(v, dist.ReduceOp.SUM)
dist.all_reduce(v, dist.ReduceOp.SUM)
dist.all_reduce(v, dist.ReduceOp.SUM)
dist.all_reduce(v, dist.ReduceOp.SUM)
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
return func(*args, **kwargs)
return func(*args, **kwargs)
^ ^ ^ ^return func(*args, **kwargs)^
^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^ ^^ ^^^ ^ ^^ ^return func(*args, **kwargs)^^ ^^
^ ^^^^ ^^
^^ ^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
^^return func(*args, **kwargs) ^ ^
^ return func(*args, **kwargs)^^ ^^return func(*args, **kwargs)
^ ^
^ ^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
^^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
^ ^ ^ ^ ^ ^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
^ ^^^ ^^^ ^^^^ ^^^^ ^^^^^^^^^^^
^^^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
^^^^^
^^
^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
^ File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
^
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
work = default_pg.allreduce([tensor], opts)
work = default_pg.allreduce([tensor], opts)
^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^RuntimeError^: ^Tensors must be CUDA and dense^
^work = default_pg.allreduce([tensor], opts)^
^^^^^^^^^^^^
RuntimeError: Tensors must be CUDA and dense work = default_pg.allreduce([tensor], opts)
^^^^^^^^^^^^ ^ ^ ^ ^work = default_pg.allreduce([tensor], opts)
^ ^work = default_pg.allreduce([tensor], opts)^
^work = default_pg.allreduce([tensor], opts) ^
^ ^ ^ ^^^^^^^ ^^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ work = default_pg.allreduce([tensor], opts) ^ ^
^ ^
^ ^ ^RuntimeError^ ^: ^Tensors must be CUDA and dense^^ ^
^^ ^ ^^^^ ^^^^ ^^^^ ^^^^^^ ^^^^ ^^^^ ^^^^^ ^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^RuntimeError^^^^^: ^^^^Tensors must be CUDA and dense^^^^
^^^^^^^^^^^^^^^^^^^^^^
^^^^^RuntimeError^^: RuntimeError^^Tensors must be CUDA and dense: ^
Tensors must be CUDA and dense^
^^RuntimeError^: ^Tensors must be CUDA and dense^
^^^^^
RuntimeError: Tensors must be CUDA and dense
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 70253 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 70272 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 70281 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 70305 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 70200) of binary: /home/hxxzhang/miniconda3/bin/python
Traceback (most recent call last):
File "/home/hxxzhang/miniconda3/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hxxzhang/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
ppo_single.py FAILED
I fixed this error by upgrading TRL to version 0.7.2. But I encountered another error after each step the GPU memory usage increased causing OOM, is it normal the GPU memory usage increase in each step? And are there any ways to reduce the memory usage beside reduce the batch_size.
But I encountered another error after each step the GPU memory usage increased causing OOM, is it normal the GPU memory usage increase in each step? And are there any ways to reduce the memory usage beside reduce the batch_size.
I'm also having OOM issues from trl 0.7.2 (also 0.7.3 and 0.7.4). Only versions up to 0.7.1 don't have this memory issue
But I encountered another error after each step the GPU memory usage increased causing OOM, is it normal the GPU memory usage increase in each step? And are there any ways to reduce the memory usage beside reduce the batch_size.
I'm also having OOM issues from trl 0.7.2 (also 0.7.3 and 0.7.4). Only versions up to 0.7.1 don't have this memory issue
Good to know, I fix OOM by empty cuda cache after each step torch.cuda.empty_cache()
.
I'm encountering the following error when running my code (see below) with multi-GPUs (single GPU works fine).
My code:
`