Distributed execution on single machine?

sovit-123 / fasterrcnn-pytorch-training-pipeline

PyTorch Faster R-CNN Object Detection on Custom Dataset

MIT License

223 stars 77 forks source link

Distributed execution on single machine? #31

Open tolsicsse opened 1 year ago

tolsicsse commented 1 year ago

Is there any instruction to follow for creating a distributed execution? Preferably for execution on a single machine with multiple GPUs. I can se that it should be possible but I don't understand how to do it.

sovit-123 commented 1 year ago

Yes, it is possible. I intend to update the README with a lot of things including this after I merge your PR. For the time being, please refer to this. For example, if you intend to train on 2 GPUs, the command should be like this:

python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --model fasterrcnn_resnet18 --config data_configs/voc.yaml --world-size 2 --batch-size 32 --workers 2 --epochs 135 --use-train-aug --project-name fasterrcnn_resnet18_voc_aug_135e

After python -m torch.distributed.launch --nproc_per_node=2 --use_env the usual training command follows.

tolsicsse commented 1 year ago

I got the distributed execution to work. However, I notice two things:

the performance is logged several time in result.csv (not a big issue, easy to fix afterwards)
the best mAP seems not be kept after resumed training, so the process seems to start from scratch so it looses the best model. This might be true also when not using several gpus

sovit-123 commented 1 year ago

I will check the logging issue that you mention above. Regarding the best model issues, can you elaborate on whether you are trying to resume training or using the best model weights on some other dataset without resuming? In any case, if you want to resume training then I recommend using the last_model.pth as it loads the optimizer state dictionary also. The mAP value with this model when resuming training will be closer to what you stopped the training with. But most probably it will not be the same. I will take 2-3 epochs to orient itself back to the same mAP.

tolsicsse commented 1 year ago

I use last_model.pth and as you can see above, at epoch 200 i resumed training, and it drops in performance and it then starts to produce new best mAP that are saved, although it was better before. Also at epoch 200 I start trainiung on 4 gpus instead of 1 so it might be the reason to why the variance decreased,

tolsicsse commented 1 year ago

Also when I resume training with last_model.pth after trained on 4 GPUs, I get the following error below. It seems that the class head is not saved correctly.

Traceback (most recent call last): File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 505, in Traceback (most recent call last): File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 505, in Traceback (most recent call last): File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 505, in main(args) File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 270, in main main(args) File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 270, in main main(args) File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 270, in main old_classes = ckpt_state_dict['roi_heads.box_predictor.cls_score.weight'].shape[0] KeyError: 'roi_heads.box_predictor.cls_score.weight' old_classes = ckpt_state_dict['roi_heads.box_predictor.cls_score.weight'].shape[0]

sovit-123 commented 1 year ago

Thanks for informing this. I will surely check this out. I am not sure how the head will differ on 4 different GPUs compared to a single GPU. But I will check this out.

I also think that the I need to implement SyncBN for distributed training that I have not done yet.