增量预训练时候，用大一些的模型（Yi-34B）在8*80G服务器上全量训练，无论-block_size 设置多少都报显存不足

Describe the bug

run_pt脚本如下：

CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node 2 pretraining.py \ --model_type auto \ --model_name_or_path ../Yi-34B-Chat \ --train_file_dir ./data/pretrain \ --validation_file_dir ./data/pretrain \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --do_train \ --do_eval \ --use_peft False \ --seed 42 \ --max_train_samples 10000 \ --max_eval_samples 10 \ --num_train_epochs 0.5 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 100 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 13 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 10 \ --block_size 16 \ --group_by_length True \ --output_dir outputs-pt-Yi-v3 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --torch_dtype bfloat16 \ --bf16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache 报错如下：

Grouping texts in chunks of 16 (num_proc=10): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 17065.83 examples/s]
Grouping texts in chunks of 16 (num_proc=10): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 15229.81 examples/s]
2024-01-23 14:25:20.855 | DEBUG    | __main__:main:592 - Num train_samples: 10000
2024-01-23 14:25:20.856 | DEBUG    | __main__:main:593 - Tokenized training example:
2024-01-23 14:25:20.857 | DEBUG    | __main__:main:594 - 曲匹地尔片的用法用量? 注意：同种药品可由于
2024-01-23 14:25:20.858 | DEBUG    | __main__:main:606 - Num eval_samples: 10
2024-01-23 14:25:20.858 | DEBUG    | __main__:main:607 - Tokenized eval example:
2024-01-23 14:25:20.858 | DEBUG    | __main__:main:608 - 曲匹地尔片的用法用量? 注意：同种药品可由于
2024-01-23 14:25:20.937 | DEBUG    | __main__:main:592 - Num train_samples: 10000
2024-01-23 14:25:20.937 | DEBUG    | __main__:main:593 - Tokenized training example:
2024-01-23 14:25:20.938 | DEBUG    | __main__:main:594 - 曲匹地尔片的用法用量? 注意：同种药品可由于
2024-01-23 14:25:20.939 | DEBUG    | __main__:main:606 - Num eval_samples: 10
2024-01-23 14:25:20.939 | DEBUG    | __main__:main:607 - Tokenized eval example:
2024-01-23 14:25:20.940 | DEBUG    | __main__:main:608 - 曲匹地尔片的用法用量? 注意：同种药品可由于
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:16<00:00,  1.12s/it]
2024-01-23 14:25:38.385 | INFO     | __main__:main:692 - Fine-tuning method: Full parameters training
trainable params: 34388917248 || all params: 34388917248 || trainable%: 100.0
2024-01-23 14:25:38.423 | INFO     | __main__:main:723 - *** Train ***
2024-01-23 14:25:38.849 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[39616,  6619, 60445, 60790,   102,  2168, 50673,  6619,  2596, 59614,
         59599, 23805, 35190,    79,    85,    77],
        [ 1610, 59632, 59988, 60937,  7087, 59642,  7632, 60130, 30719,  6890,
           102, 25092, 59590,  7353,  9435, 43310]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:1'), 'labels': tensor([[39616,  6619, 60445, 60790,   102,  2168, 50673,  6619,  2596, 59614,
         59599, 23805, 35190,    79,    85,    77],
        [ 1610, 59632, 59988, 60937,  7087, 59642,  7632, 60130, 30719,  6890,
           102, 25092, 59590,  7353,  9435, 43310]], device='cuda:1')}
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:18<00:00,  1.26s/it]
2024-01-23 14:25:40.251 | INFO     | __main__:main:692 - Fine-tuning method: Full parameters training
trainable params: 34388917248 || all params: 34388917248 || trainable%: 100.0
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-01-23 14:25:40.293 | INFO     | __main__:main:723 - *** Train ***
2024-01-23 14:25:40.724 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[ 7995,   101,  2480,  5060,  2986,  3603,   101, 59961, 61232,  1738,
          1138,  6440, 59635,  2665,  2893,   101],
        [59632, 30421, 60764, 60928,  3898, 59678, 59907, 60602, 60279, 10894,
         59664, 61598, 60167, 36358,   101,  1913]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0'), 'labels': tensor([[ 7995,   101,  2480,  5060,  2986,  3603,   101, 59961, 61232,  1738,
          1138,  6440, 59635,  2665,  2893,   101],
        [59632, 30421, 60764, 60928,  3898, 59678, 59907, 60602, 60279, 10894,
         59664, 61598, 60167, 36358,   101,  1913]], device='cuda:0')}
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.05 GiB. GPU 0 has a total capacty of 79.33 GiB of which 14.34 GiB is free. Including non-PyTorch memory, this process has 64.98 GiB memory in use. Of the allocated memory 64.17 GiB is allocated by PyTorch, and 323.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.05 GiB. GPU 1 has a total capacty of 79.33 GiB of which 14.34 GiB is free. Including non-PyTorch memory, this process has 64.98 GiB memory in use. Of the allocated memory 64.17 GiB is allocated by PyTorch, and 323.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-23 14:25:45,100] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3264173) of binary: /home/centos/anaconda3/envs/cpt/bin/python
Traceback (most recent call last):
  File "/home/centos/anaconda3/envs/cpt/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-23_14:25:45
  host      : host188
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3264174)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-23_14:25:45
  host      : host188
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3264173)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

另外请问一下： --block_size 的作用是什么呢，通过batch_size可以调解显存使用，为啥还需要这个参数呢？

CUDA_VISIBLE_DEVICES=0,1,2,3 python pretraining.py

已经设置了，依然报的同样的错误；

Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.05 GiB. GPU 1 has a total capacty of 79.33 GiB of which 11.05 GiB is free. Process 2993049 has 3.10 GiB memory in use. Including non-PyTorch memory, this process has 65.17 GiB memory in use. Of the allocated memory 64.17 GiB is allocated by PyTorch, and 323.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.05 GiB. GPU 2 has a total capacty of 79.33 GiB of which 14.15 GiB is free. Including non-PyTorch memory, this process has 65.17 GiB memory in use. Of the allocated memory 64.17 GiB is allocated by PyTorch, and 323.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(    
main()  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__

  File "pretraining.py", line 728, in main
    self._ddp_init_helper(    
train_result = trainer.train(resume_from_checkpoint=checkpoint)  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper

  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.05 GiB. GPU 3 has a total capacty of 79.33 GiB of which 14.24 GiB is free. Including non-PyTorch memory, this process has 65.07 GiB memory in use. Of the allocated memory 64.17 GiB is allocated by PyTorch, and 323.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1355, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 809, in __init__
    self._ddp_init_helper(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1098, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.05 GiB. GPU 0 has a total capacty of 79.33 GiB of which 14.24 GiB is free. Including non-PyTorch memory, this process has 65.07 GiB memory in use. Of the allocated memory 64.17 GiB is allocated by PyTorch, and 323.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-23 15:22:18,696] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3272628) of binary: /home/centos/anaconda3/envs/cpt/bin/python
Traceback (most recent call last):
  File "/home/centos/anaconda3/envs/cpt/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-23_15:22:18
  host      : host188
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3272629)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-01-23_15:22:18
  host      : host188
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3272630)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-01-23_15:22:18
  host      : host188
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 3272631)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-23_15:22:18
  host      : host188
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3272628)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

shibing624 / MedicalGPT

增量预训练时候，用大一些的模型（Yi-34B）在8*80G服务器上全量训练，无论-block_size 设置多少都报显存不足 #314

Describe the bug