求助：2机16卡预训练出bug问题,请大佬帮忙看看

listwebit commented 5 months ago

Describe the bug

机器： 2个机器，每个 8卡 8*80G 的 A800 共有：1280G的显存使用deepspeed方式，stage2和stage3都报出GPU不足。预训练方式：全量更新模型使用模型：Yi-34B-Chat \ 下面有run_pt.sh脚本信息、ds_report信息，错误信息、并且deepspeed_zero_stage2_config.json 内容已经修改了(不知道改的对不对)：


"fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

run_pt.sh脚本：主机器：

node_rank=0
echo ${node_rank}
master_addr="112.168.17.188"

torchrun --nproc_per_node 8  --nnodes 2 --master_addr ${master_addr} --master_port 14545 --node_rank ${node_rank} pretraining.py \
    --model_type auto \
    --model_name_or_path ../Yi-34B-Chat \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 13 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 10 \
    --block_size 512 \
    --group_by_length True \
    --output_dir outputs-pt-Yi-v4 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --torch_dtype bfloat16 \
    --bf16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --deepspeed deepspeed_zero_stage2_config.json
    --cache_dir ./cache

run_pt.sh脚本副机器：

node_rank=1
echo ${node_rank}
master_addr="112.168.17.188"

torchrun --nproc_per_node 8  --nnodes 2 --master_addr ${master_addr} --master_port 14545 --node_rank ${node_rank} pretraining.py \
    --model_type auto \
    --model_name_or_path ../Yi-34B-Chat \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 13 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 10 \
    --block_size 512 \
    --group_by_length True \
    --output_dir outputs-pt-Yi-v4 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --torch_dtype bfloat16 \
    --bf16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --deepspeed deepspeed_zero_stage2_config.json
    --cache_dir ./cache

两个机器的ds_report相同如下：

(cpt) [centos@host188 MedicalGPT]$ ds_report 
[2024-01-24 19:40:17,225] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 1007.51 GB

最后两个机器都报错GPU内存不足：

2024-01-24 19:37:51.644 | DEBUG    | __main__:main:724 - Train dataloader example: {'input_ids': tensor([[61196, 60113,   105,  ...,   536,   457,   457],
        [21211,   101,  6427,  ..., 59728, 10856, 59676]], device='cuda:7'), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:7'), 'labels': tensor([[61196, 60113,   105,  ...,   536,   457,   457],
        [21211,   101,  6427,  ..., 59728, 10856, 59676]], device='cuda:7')}
Traceback (most recent call last):
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
  File "pretraining.py", line 767, in <module>
    main()    main()

  File "pretraining.py", line 728, in main
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)    
train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
        return inner_training_loop(return inner_training_loop(

  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
    main()
  File "pretraining.py", line 728, in main
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1219, in prepare
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
      File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1219, in prepare
train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    result = self._prepare_deepspeed(*args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
    result = self._prepare_deepspeed(*args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    engine = DeepSpeedEngine(args=args,
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1219, in prepare
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1503, in _configure_zero_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1503, in _configure_zero_optimizer
    result = self._prepare_deepspeed(*args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 529, in __init__
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 529, in __init__
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/__init__.py", line 171, in initialize
    self.initialize_optimizer_states()
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 656, in initialize_optimizer_states
    self.initialize_optimizer_states()
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 656, in initialize_optimizer_states
    engine = DeepSpeedEngine(args=args,
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    single_grad_partition = torch.zeros(int(self.partition_size[i]),    
self._configure_optimizer(optimizer, model_parameters)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
    torch.cudasingle_grad_partition = torch.zeros(int(self.partition_size[i]),
.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.01 GiB. GPU 6 has a total capacty of 79.33 GiB of which 6.37 GiB is free. Including non-PyTorch memory, this process has 72.95 GiB memory in use. Of the allocated memory 72.18 GiB is allocated by PyTorch, and 3.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFtorch.cuda
.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.01 GiB. GPU 1 has a total capacty of 79.33 GiB of which 6.32 GiB is free. Including non-PyTorch memory, this process has 72.99 GiB memory in use. Of the allocated memory 72.18 GiB is allocated by PyTorch, and 3.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1503, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 529, in __init__
    self.initialize_optimizer_states()
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 656, in initialize_optimizer_states
    single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.01 GiB. GPU 4 has a total capacty of 79.33 GiB of which 6.32 GiB is free. Including non-PyTorch memory, this process has 72.99 GiB memory in use. Of the allocated memory 72.18 GiB is allocated by PyTorch, and 3.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "pretraining.py", line 767, in <module>
    main()
  File "pretraining.py", line 728, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1219, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1503, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 529, in __init__
    self.initialize_optimizer_states()
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 656, in initialize_optimizer_states
    single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.01 GiB. GPU 5 has a total capacty of 79.33 GiB of which 6.35 GiB is free. Including non-PyTorch memory, this process has 72.97 GiB memory in use. Of the allocated memory 72.18 GiB is allocated by PyTorch, and 3.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-24 19:38:56,227] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 138543 closing signal SIGTERM
[2024-01-24 19:38:56,227] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 138545 closing signal SIGTERM
[2024-01-24 19:38:56,227] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 138546 closing signal SIGTERM
[2024-01-24 19:38:56,227] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 138548 closing signal SIGTERM
[2024-01-24 19:38:56,227] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 138550 closing signal SIGTERM
[2024-01-24 19:39:07,644] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 138544) of binary: /home/centos/anaconda3/envs/cpt/bin/python
Traceback (most recent call last):
  File "/home/centos/anaconda3/envs/cpt/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/centos/anaconda3/envs/cpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED

请大佬指明方向，谢谢

shibing624 commented 5 months ago

--nproc_per_node 8 改为 --nproc_per_node 1，不要数据并行，否则会显存不够。

你是A800机器，可以qlora训练，单机8卡就能跑。看下wiki：https://github.com/shibing624/MedicalGPT/wiki/%E8%AE%AD%E7%BB%83%E5%8F%82%E6%95%B0%E8%AF%B4%E6%98%8E

shibing624 commented 5 months ago

你直接用8卡跑就可以成功。

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python pretraining.py \
    --model_type auto \
    --model_name_or_path ../Yi-34B-Chat \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 13 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 10 \
    --block_size 512 \
    --group_by_length True \
    --output_dir outputs-pt-Yi-v4 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --torch_dtype bfloat16 \
    --bf16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --cache_dir ./cache

1.以上命令模型自动执行流水线并行。 2.还可以加上qlora节约显存。

listwebit commented 5 months ago

徐老师，这种方式是可以在单台机器上跑起来，但是在这种方式 GPU使用效率太低了，同一时刻只有一块卡在工作，实际训练中不可接收的，而且我有多个机器，想同时利用起来，想用deepspeed方式，老师能帮忙解决一下下面的问题吗？是不是目前代码不支持呀。

1.按照您文档说明：30B模型全量参数更新，需要600GB的显存，我现在两个机器1280G内存，理论上应该可以跑起来的，是什么原因跑不起了呢 2.按照您的说明我已经使用了模型并行：--deepspeed deepspeed_zero_stage3_config.json ，但是两个机器还是报GPU不足 3.老师我想用全量参数更新，不想用qlora训练，或者lora等其他方式，2台机器能不能跑起来呢，是代码问题还是，机器资源不足呢，如果是机器资源不足，需要多少天机器呢，会不会增加机器最后还是报GPU资源不足呢 4.您回复：“-nproc_per_node 8 改为 --nproc_per_node 1，不要数据并行，否则会显存不够“ ，nproc_per_node 从8改为1，不就是只占用了1个GPU卡，我试了试，更跑不起来，用模型并行指的就是：--deepspeed deepspeed_zero_stage3_config.json 把，我试试还是报GPU不足。

shibing624 commented 5 months ago

流水线并行没有gpu利用率低的情况，也不是只有一块卡在工作，它是最大化gpu利用率，多个gpu协同工作的方式。transformers官方支持的就是流水线并行。

你有2个机器，就把数据分成2份跑就可以。

如果想多机多卡，去看看deepspeed，不要问环境问题了，不是这个项目的内容。

wangrx33 commented 4 months ago

徐老师，这种方式是可以在单台机器上跑起来，但是在这种方式 GPU使用效率太低了，同一时刻只有一块卡在工作，实际训练中不可接收的，而且我有多个机器，想同时利用起来，想用deepspeed方式，老师能帮忙解决一下下面的问题吗？是不是目前代码不支持呀。

1.按照您文档说明：30B模型全量参数更新，需要600GB的显存，我现在两个机器1280G内存，理论上应该可以跑起来的，是什么原因跑不起了呢 2.按照您的说明我已经使用了模型并行：--deepspeed deepspeed_zero_stage3_config.json ，但是两个机器还是报GPU不足 3.老师我想用全量参数更新，不想用qlora训练，或者lora等其他方式，2台机器能不能跑起来呢，是代码问题还是，机器资源不足呢，如果是机器资源不足，需要多少天机器呢，会不会增加机器最后还是报GPU资源不足呢 4.您回复：“-nproc_per_node 8 改为 --nproc_per_node 1，不要数据并行，否则会显存不够“ ，nproc_per_node 从8改为1，不就是只占用了1个GPU卡，我试了试，更跑不起来，用模型并行指的就是：--deepspeed deepspeed_zero_stage3_config.json 把，我试试还是报GPU不足。

请问您这个问题解决了吗？我现在也发现显存明显够用的情况下出现了oom的错误。同样，如果不用deepspeed直接跑的话训练会很慢，同一时刻只有一张卡占用达到100%，其他都是0。

shibing624 commented 4 months ago

可以自己试试magetrol , deepspeed等并行工具加速，我对这块不熟悉，我这边用A100跑挺快的。

onex7777 commented 4 months ago

你好，这个V100 16G单卡加载不了qwen7b进行DPO，怎样增加代码可是使用deepspeed zreo3？

shibing624 commented 4 months ago

加zero3的config配置运行就可以。

shibing624 / MedicalGPT

求助：2机16卡预训练出bug问题,请大佬帮忙看看 #318

Describe the bug