open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
7.7k stars 2.53k forks source link

slurm problem #3591

Open scar-on opened 3 months ago

scar-on commented 3 months ago

I ran the code on the server managed by Slurm, and my environment and data did not change. I just ran the code at different times. The error output seems to be a problem with multi-threaded running of sludge, involving communication. The error is as follows:

微信图片_20240312115551

My running command is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 GPUS=8 MASTER_PORT=29500 sh tools/slurm_train.sh gpu5 mae

My environment is: 微信图片_20240312115906

I don't know if anyone has experienced the same problem as me.