Issue: pretrain.sh Can be trained successfully, but finetune_full_schedule.sh, process memory usage exceeded on V100, Is there any way to solve this problem?
trainer.train()
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1184, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1419, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 372, in __init__
dist.barrier()
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 394, in barrier
return cdb.barrier(group=group, async_op=async_op)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 225, in barrier
return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3330, in barrier
work = group.barrier(opts=opts)
RuntimeError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe. This may indicate a possible application crash on rank 0 or a network set up issue.
Traceback (most recent call last):
Describe the issue
Issue: pretrain.sh Can be trained successfully, but finetune_full_schedule.sh, process memory usage exceeded on V100, Is there any way to solve this problem?
Command:
Log: