Closed Frankey419 closed 1 year ago
用deepspeed直接跑,一直报错。不知道是不是和设置有关
WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: *** Device name: mlx5_4 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
Segmentation fault (core dumped)
用deepspeed直接跑,一直报错。不知道是不是和设置有关
WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: *** Device name: mlx5_4 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0. Segmentation fault (core dumped)
修改 data_utils.py devices 即可 , 另外单机多卡 安装nccl , 重新安装deepseed , 如果你对使用deepseed 有疑问参考deepseed 故障处理。
多机多卡训练 例子 3个机器 每个机器 4个卡 修改train.py Trainer num_nodes = 3 MASTER_ADDR=10.0.0.1 MASTER_PORT=6667 WORLD_SIZE=12 NODE_RANK=0 python train.py MASTER_ADDR=10.0.0.1 MASTER_PORT=6667 WORLD_SIZE=12 NODE_RANK=1 python train.py MASTER_ADDR=10.0.0.1 MASTER_PORT=6667 WORLD_SIZE=12 NODE_RANK=2 python train.py
能给个类似的例子吗?比如单机2个卡