shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

求助:2机8卡训练SFT时卡住 #250

Closed mymong closed 10 months ago

mymong commented 11 months ago

2机8卡训练SFT时卡住

单机4卡测试训练PT和SFT都没有任何问题,但是在2机8卡测试分布式训练时,会在SFT中卡住。

求助求助求助!请大佬帮忙看下什么原因?!

具体脚本和日志如下:

脚本(主机):

CUDA_VISIBLE_DEVICES=0,1,2,3 \
    torchrun \
    --nproc_per_node 4 \
    --nnodes 2 \
    --master_addr 10.130.1.109 \
    --master_port 7860 \
    --node_rank 0 \
    supervised_finetuning.py \
    --model_type bloom \
    --model_name_or_path merged-pt \
    --train_file_dir ./data/finetune \
    --validation_file_dir ./data/finetune \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --use_peft True \
    --fp16 \
    --max_train_samples 1000 \
    --max_eval_samples 10 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.05 \
    --weight_decay 0.05 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --output_dir outputs-sft-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True 

脚本(副机):

CUDA_VISIBLE_DEVICES=0,1,2,3 \
    torchrun \
    --nproc_per_node 4 \
    --nnodes 2 \
    --master_addr 10.130.1.109 \
    --master_port 7860 \
    --node_rank 1 \
    supervised_finetuning.py \
    ...<同上>...

日志(主机):

...<省略>...
labels: [tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100, 99464,     2],
       device='cuda:0'), tensor([  -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   6768,   7786, 130015,      2],
       device='cuda:0'), tensor([  -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,  41381,    355,   9759,   8589,  12402,   1533,   1616,   2386,
         15388,   1570,  76353,  11111,    706,   9602,   7786,   2269,   4587,
         32622,    355,  21920,  67956,   1262,  21397,  74869,   9110,  19471,
         25011,    355,    706,   4198,   2405,   8107,  11812, 175337,    420,
          2293,   9759,   8589,  12402,   1570,  29434,   8967,    355, 100006,
         17549,   9759,   8589,  11575,   9759,   8589,   6167,    355,  90899,
         44498,   7640,  77684,    355,  67517,   5731,  17846,   5197,  11111,
        246141,    355,  11225, 168726,   6167,   4007, 117443,    420,      2],
       device='cuda:0')]
2023-11-01 10:40:24.574 | DEBUG    | __main__:main:1279 - Decode input_ids[0]: <pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.</s>USER: 原发性巨球蛋白血症的治愈率是多少? ASSISTANT:40%</s>
2023-11-01 10:40:24.579 | DEBUG    | __main__:main:1282 - Decode labels[0]: <pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>40%</s>
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857

2023-11-01 10:40:24.722 | INFO     | __main__:main:1274 - *** Train ***
2023-11-01 10:40:24.722 | INFO     | __main__:main:1274 - *** Train ***

日志(副机):

...<省略>...
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
2023-11-01 10:40:18.920 | DEBUG    | __main__:main:1097 - A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.</s>USER: 治疗阳痿吃什么药呢?,性生活一直很正常的,但是这段时间感觉性欲变低了,有时勃起都感觉很困难,试过许多的方法都没效果,听朋友说我这种情况可能会是早泄,想知道治疗早泄的药物? ASSISTANT:男子早泄、早泄病症的再次发生,多由恣情纵欲,或青年误犯性交,至命门火衰,精气虚寒;或思量忧郁,伤损心脾;或因恐惧伤肾,也有因湿热下注,宗筋弛而痿的。但主要是肾阳虚衰而痿。肾阳为那身阳气之根本,有温煦形体,蒸化水液,增进围产生长发育等功能。肾阳虚衰则温煦失责,气化无权。因而再次发生畏寒肢冷,性机能减退。故见男子早泄不举或不坚,且伴发头晕目眩。</s>
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
2023-11-01 10:40:20.649 | INFO     | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.649 | INFO     | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.650 | INFO     | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.650 | INFO     | __main__:main:1228 - Peft lora_rank: 8
2023-11-01 10:40:20.947 | INFO     | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.947 | INFO     | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.947 | INFO     | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.947 | INFO     | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.947 | INFO     | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.947 | INFO     | __main__:main:1228 - Peft lora_rank: 8
2023-11-01 10:40:20.947 | INFO     | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.947 | INFO     | __main__:main:1228 - Peft lora_rank: 8
2023-11-01 10:40:20.948 | INFO     | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.948 | INFO     | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.949 | INFO     | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.949 | INFO     | __main__:main:1228 - Peft lora_rank: 8
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
2023-11-01 10:40:24.389 | INFO     | __main__:main:1274 - *** Train ***
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
2023-11-01 10:40:24.690 | INFO     | __main__:main:1274 - *** Train ***
2023-11-01 10:40:24.695 | INFO     | __main__:main:1274 - *** Train ***
2023-11-01 10:40:24.709 | INFO     | __main__:main:1274 - *** Train ***

nvidia-smi 查看到的显卡状态:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  On   | 00000000:00:0D.0 Off |                    0 |
| N/A   41C    P0    56W / 250W |   1917MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  On   | 00000000:00:0E.0 Off |                    0 |
| N/A   39C    P0    51W / 250W |   1917MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100S-PCI...  On   | 00000000:00:0F.0 Off |                    0 |
| N/A   39C    P0    53W / 250W |   1917MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100S-PCI...  On   | 00000000:00:10.0 Off |                    0 |
| N/A   40C    P0    53W / 250W |   1917MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   4049068      C   ...iant/anaconda3/bin/python     1913MiB |
|    1   N/A  N/A   4049069      C   ...iant/anaconda3/bin/python     1913MiB |
|    2   N/A  N/A   4049070      C   ...iant/anaconda3/bin/python     1913MiB |
|    3   N/A  N/A   4049071      C   ...iant/anaconda3/bin/python     1913MiB |
+-----------------------------------------------------------------------------+
mymong commented 11 months ago

卡在1286行:

        train_result = trainer.train(resume_from_checkpoint=checkpoint)
shibing624 commented 11 months ago

通信问题?我不清楚你的环境,多机多卡可以用deepspeed,ds_report给下

nuoma commented 11 months ago

我现在是全量SFT,单机6卡,在保存权重后会卡住。