I ran the code on the server managed by Slurm, and my environment and data did not change. I just ran the code at different times. The error output seems to be a problem with multi-threaded running of sludge, involving communication. The error is as follows:
My running command is:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 GPUS=8 MASTER_PORT=29500 sh tools/slurm_train.sh gpu5 mae
My environment is:
I don't know if anyone has experienced the same problem as me.
I ran the code on the server managed by Slurm, and my environment and data did not change. I just ran the code at different times. The error output seems to be a problem with multi-threaded running of sludge, involving communication. The error is as follows:
My running command is:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 GPUS=8 MASTER_PORT=29500 sh tools/slurm_train.sh gpu5 mae
My environment is:![微信图片_20240312115906](https://github.com/open-mmlab/mmsegmentation/assets/57258378/63ebe871-832e-4c5e-8f22-8c8b6b09110f)
I don't know if anyone has experienced the same problem as me.