ssbuild / chatglm_finetuning

chatglm 6b finetuning and alpaca finetuning
1.54k stars 176 forks source link

单机多卡如何设置 #190

Closed Frankey419 closed 1 year ago

Frankey419 commented 1 year ago

多机多卡训练 例子 3个机器 每个机器 4个卡 修改train.py Trainer num_nodes = 3 MASTER_ADDR=10.0.0.1 MASTER_PORT=6667 WORLD_SIZE=12 NODE_RANK=0 python train.py MASTER_ADDR=10.0.0.1 MASTER_PORT=6667 WORLD_SIZE=12 NODE_RANK=1 python train.py MASTER_ADDR=10.0.0.1 MASTER_PORT=6667 WORLD_SIZE=12 NODE_RANK=2 python train.py

能给个类似的例子吗?比如单机2个卡

Frankey419 commented 1 year ago

用deepspeed直接跑,一直报错。不知道是不是和设置有关

WARNING: No preset parameters were found for the device that Open MPI detected:

Local host: *** Device name: mlx5_4 Device vendor ID: 0x02c9 Device vendor part ID: 4123

Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.

NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0.

Segmentation fault (core dumped)

ssbuild commented 1 year ago

用deepspeed直接跑,一直报错。不知道是不是和设置有关

WARNING: No preset parameters were found for the device that Open MPI detected:

Local host: *** Device name: mlx5_4 Device vendor ID: 0x02c9 Device vendor part ID: 4123

Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.

NOTE: You can turn off this warning by setting the MCA parameter

btl_openib_warn_no_device_params_found to 0. Segmentation fault (core dumped)

修改 data_utils.py devices 即可 , 另外单机多卡 安装nccl , 重新安装deepseed , 如果你对使用deepseed 有疑问参考deepseed 故障处理。