Closed vdabravolski closed 4 years ago
For the default setting, we train with multiple nodes on a shared storage. If there is no shared storage for the nodes, you can specify gpu_collect = True
in the evaluation
filed in the config file.
Thanks, that addressed my issue. Closing ticket now.
For reference, this setting needs to be updated in base model config file: evaluation.gpu_collect=True
@vdabravolski Hello. I noticed you launched your training job several times on several nodes separately, instead of using the slurm_train.sh script provided by mmdetection. I personally tried to use this script in mmdetection 1.0 to launch training over multiple nodes of a cluster managed with slurm, but my training just won't start. I doubt slurm failed to create training processes. Have you experienced similar issues? Since you did not use the slurm_train.sh script as well.
@HawkRong, I didn't use Slurm at all. I used Amazon Sagemaker as training cluster, and Sagemaker doesn't support Slurm. Therefore, I just launched number of separate processes equal to GPU devices on each compute node. I used Python Subprocess to manage individual training processes. You can find my training script here: https://github.com/vdabravolski/mmdetection-sagemaker/blob/master/container_training/mmdetection_train.py#L213-L234
@vdabravolski Since the cluster available to me only support slurm, and the launch script in mmdetection 1.0 seems not working in my case, I plan to mimic your approach. I'm wondering if I use 'sbatch' command in slurm to launch several 1-node jobs separately, each executing the 'python -m torch.distributed.launch --nproc_per_node=x --nnodes=y --node_rank=z ...', will work? Another relevant question: will 'torch.distributed.init_process_group' do the job of waiting for all separate processes I launched on different nodes?
@vdabravolski The validation went well using mmdetection 1.0.
For the default setting, we train with multiple nodes on a shared storage. If there is no shared storage for the nodes, you can specify
gpu_collect = True
in theevaluation
filed in the config file.
I trained mask-rcnn with mmdet 2.5.0 with 2 gpu, and met the error during testing, I followed your suggestion to specify gpu_collect = True
in the evaluation
filed in the config file, then the training process went well.
Error Message:
FileNotFoundError: [Errno 2] No such file or directory: 'xxxxxxxxxxxxx.eval_hook/part_1.pkl'
@FantasyJXF
I have also meet the problem, have you ever solved this?
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug I'm running distributed training on Amazon Sagemaker (AWS ML service). Ideally, I'd like to run training of COCO2017 from scratch on 4 nodes of P3.16xlarge (each with 8 GPUs).
After starting the training (each training node invokes command below), the training process goes as expected and I see that the model is training succesfully. However, after training completed and script tries to run validation, it fails with following stack trace (see below stacktrace section).
I suspect that the error is caused by the fact that validation hook is not adopted for multi-node environment. Hence, it cannot find validation outcomes for training processes outside of first node.
If that's the case, i'd like to see what approach I can take to run validation in multi-node environment. Do I need to create custom validation hook?
Reproduction
What command or script did you run? See below training container. Each container kicks off training with command:
python -m torch.distributed.launch --nnodes 4 --node_rank 0 --nproc_per_node 8 --master_addr algo-1 --master_port 55555 /opt/ml/code/mmdetection/tools/train.py /opt/ml/code/updated_config.py --launcher pytorch --work-dir /opt/ml/output
Did you make any modifications on the code or config? Did you understand what you have modified? I used
configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py
with only modification to decrease number of epochs to 1 (in order to speed up testing cycles). I used "--options" intools/train.py
to override default number of training epochs.What dataset did you use? COCO2017 training and validation only.
Environment
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here.$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)See dockerfile used for training:
Error traceback If applicable, paste the error trackback here.
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!