增量预训练 deepspeed模式，小模型可以run，大一些的模型报错

Describe the bug

机器配置：8*80G A800卡按照作者issue里面的提示：修改了deepspeed_zero_stage2_config.json的，

 "fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1

使用chatglm-6B的小一些模型，通过deepspeed方式，全量训练，无论zero1还是zero2都可以run 如果改为：Yi-34B-Chat ，全量训练，无论zero1还是zero2都报GPU不足难道一台8卡的服务器显卡还不足在34B的全量增量预训练吗？ run_pt.sh 配置：

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node 8  pretraining.py \
    --model_type auto \
    --model_name_or_path ../Yi-34B-Chat \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 13 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 10 \
    --block_size 512 \
    --group_by_length True \
    --output_dir outputs-pt-Yi-v4 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --torch_dtype bfloat16 \
    --bf16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --deepspeed deepspeed_zero_stage2_config.json
    --cache_dir ./cache

报错如下：

2024-01-24 10:35:03.180 | INFO     | __main__:main:723 - *** Train ***
2024-01-24 10:35:05.503 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59652,   101, 59678,  ..., 60485, 60688, 59828],
        [  106, 59810, 40183,  ..., 59652,  7770, 61186]], device='cuda:2'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:2'), 'labels': tensor([[59652,   101, 59678,  ..., 60485, 60688, 59828],
        [  106, 59810, 40183,  ..., 59652,  7770, 61186]], device='cuda:2')}
2024-01-24 10:35:05.857 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59778,   102,  3215,  ..., 37469, 59599,  5845],
        [59722,   101, 60079,  ...,   101, 59784, 23307]], device='cuda:3'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:3'), 'labels': tensor([[59778,   102,  3215,  ..., 37469, 59599,  5845],
        [59722,   101, 60079,  ...,   101, 59784, 23307]], device='cuda:3')}
2024-01-24 10:35:06.311 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59693, 59780,    81,  ...,   101, 28770, 60158],
        [20438, 59748,  5487,  ..., 59773,   101, 59818]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'), 'labels': tensor([[59693, 59780,    81,  ...,   101, 28770, 60158],
        [20438, 59748,  5487,  ..., 59773,   101, 59818]], device='cuda:0')}
2024-01-24 10:35:06.671 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59740, 59795,  9843,  ..., 59626, 59763, 60025],
        [28540, 60660, 60054,  ...,    77,    77,    79]], device='cuda:4'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:4'), 'labels': tensor([[59740, 59795,  9843,  ..., 59626, 59763, 60025],
        [28540, 60660, 60054,  ...,    77,    77,    79]], device='cuda:4')}
2024-01-24 10:35:06.687 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[61196, 60113,   105,  ...,   536,   457,   457],
        [21211,   101,  6427,  ..., 59728, 10856, 59676]], device='cuda:7'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:7'), 'labels': tensor([[61196, 60113,   105,  ...,   536,   457,   457],
        [21211,   101,  6427,  ..., 59728, 10856, 59676]], device='cuda:7')}
2024-01-24 10:35:06.842 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[59652,  4387, 12486,  ..., 60158,  4898, 21567],
        [59604,   106,    78,  ..., 60545, 60164, 61068]], device='cuda:6'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:6'), 'labels': tensor([[59652,  4387, 12486,  ..., 60158,  4898, 21567],
        [59604,   106,    78,  ..., 60545, 60164, 61068]], device='cuda:6')}
2024-01-24 10:35:06.904 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[ 7019,  8635,  6571,  ..., 24141, 21391,   102],
        [  100, 59568, 60509,  ..., 60781,   102,    80]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:1'), 'labels': tensor([[ 7019,  8635,  6571,  ..., 24141, 21391,   102],
        [  100, 59568, 60509,  ..., 60781,   102,    80]], device='cuda:1')}
2024-01-24 10:35:06.912 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[ 4202,   101,  2363,  ..., 59932, 41236, 59599],
        [59594,    85, 23340,  ...,  2363,  2480,  2598]], device='cuda:5'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:5'), 'labels': tensor([[ 4202,   101,  2363,  ..., 59932, 41236, 59599],
        [59594,    85, 23340,  ...,  2363,  2480,  2598]], device='cuda:5')}
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 7 has a total capacty of 79.33 GiB of which 187.06 MiB is free. Process 3001206 has 6.43 GiB memory in use. Process 3007944 has 3.10 GiB memory in use. Including non-PyTorch memory, this process has 69.60 GiB memory in use. Of the allocated memory 68.69 GiB is allocated by PyTorch, and 6.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 6 has a total capacty of 79.33 GiB of which 229.81 MiB is free. Including non-PyTorch memory, this process has 79.09 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 53.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 1 has a total capacty of 79.33 GiB of which 190.44 MiB is free. Process 2993049 has 3.16 GiB memory in use. Including non-PyTorch memory, this process has 75.97 GiB memory in use. Of the allocated memory 74.93 GiB is allocated by PyTorch, and 48.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 5 has a total capacty of 79.33 GiB of which 259.81 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 23.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 3 has a total capacty of 79.33 GiB of which 259.81 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 23.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacty of 79.33 GiB of which 45.81 MiB is free. Including non-PyTorch memory, this process has 79.27 GiB memory in use. Of the allocated memory 78.32 GiB is allocated by PyTorch, and 53.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 4 has a total capacty of 79.33 GiB of which 229.81 MiB is free. Including non-PyTorch memory, this process has 79.09 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 53.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB. GPU 2 has a total capacty of 79.33 GiB of which 259.81 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.04 GiB is allocated by PyTorch, and 23.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-24 10:35:16,640] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3467576) of binary: /home/centos/anaconda3/envs/cpt/bin/python
Traceback (most recent call last):
  File "/home/centos/anaconda3/envs/cpt1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3467577)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3467578)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 3467579)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 3467580)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 3467581)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 3467582)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 3467583)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-24_10:35:16
  host      : host188
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3467576)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

如果不是代码问题，需要几个机器才能run起来呢？

shibing624 / MedicalGPT

增量预训练 deepspeed模式，小模型可以run，大一些的模型报错 #317

Describe the bug