...<省略>...
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
2023-11-01 10:40:18.920 | DEBUG | __main__:main:1097 - A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.</s>USER: 治疗阳痿吃什么药呢?,性生活一直很正常的,但是这段时间感觉性欲变低了,有时勃起都感觉很困难,试过许多的方法都没效果,听朋友说我这种情况可能会是早泄,想知道治疗早泄的药物? ASSISTANT:男子早泄、早泄病症的再次发生,多由恣情纵欲,或青年误犯性交,至命门火衰,精气虚寒;或思量忧郁,伤损心脾;或因恐惧伤肾,也有因湿热下注,宗筋弛而痿的。但主要是肾阳虚衰而痿。肾阳为那身阳气之根本,有温煦形体,蒸化水液,增进围产生长发育等功能。肾阳虚衰则温煦失责,气化无权。因而再次发生畏寒肢冷,性机能减退。故见男子早泄不举或不坚,且伴发头晕目眩。</s>
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
2023-11-01 10:40:20.649 | INFO | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.649 | INFO | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.650 | INFO | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.650 | INFO | __main__:main:1228 - Peft lora_rank: 8
2023-11-01 10:40:20.947 | INFO | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.947 | INFO | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.947 | INFO | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.947 | INFO | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.947 | INFO | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.947 | INFO | __main__:main:1228 - Peft lora_rank: 8
2023-11-01 10:40:20.947 | INFO | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.947 | INFO | __main__:main:1228 - Peft lora_rank: 8
2023-11-01 10:40:20.948 | INFO | __main__:main:1213 - Fine-tuning method: LoRA(PEFT)
2023-11-01 10:40:20.948 | INFO | __main__:main:1218 - Init new peft model
2023-11-01 10:40:20.949 | INFO | __main__:main:1227 - Peft target_modules: ['dense', 'dense_4h_to_h', 'dense_h_to_4h', 'query_key_value']
2023-11-01 10:40:20.949 | INFO | __main__:main:1228 - Peft lora_rank: 8
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
2023-11-01 10:40:24.389 | INFO | __main__:main:1274 - *** Train ***
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
trainable params: 3,145,728 || all params: 562,360,320 || trainable%: 0.5593794384354857
2023-11-01 10:40:24.690 | INFO | __main__:main:1274 - *** Train ***
2023-11-01 10:40:24.695 | INFO | __main__:main:1274 - *** Train ***
2023-11-01 10:40:24.709 | INFO | __main__:main:1274 - *** Train ***
nvidia-smi 查看到的显卡状态:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... On | 00000000:00:0D.0 Off | 0 |
| N/A 41C P0 56W / 250W | 1917MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... On | 00000000:00:0E.0 Off | 0 |
| N/A 39C P0 51W / 250W | 1917MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100S-PCI... On | 00000000:00:0F.0 Off | 0 |
| N/A 39C P0 53W / 250W | 1917MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100S-PCI... On | 00000000:00:10.0 Off | 0 |
| N/A 40C P0 53W / 250W | 1917MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4049068 C ...iant/anaconda3/bin/python 1913MiB |
| 1 N/A N/A 4049069 C ...iant/anaconda3/bin/python 1913MiB |
| 2 N/A N/A 4049070 C ...iant/anaconda3/bin/python 1913MiB |
| 3 N/A N/A 4049071 C ...iant/anaconda3/bin/python 1913MiB |
+-----------------------------------------------------------------------------+
2机8卡训练SFT时卡住
单机4卡测试训练PT和SFT都没有任何问题,但是在2机8卡测试分布式训练时,会在SFT中卡住。
求助求助求助!请大佬帮忙看下什么原因?!
具体脚本和日志如下:
脚本(主机):
脚本(副机):
日志(主机):
日志(副机):
nvidia-smi 查看到的显卡状态: