Open tolsicsse opened 1 year ago
Yes, it is possible. I intend to update the README with a lot of things including this after I merge your PR. For the time being, please refer to this. For example, if you intend to train on 2 GPUs, the command should be like this:
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --model fasterrcnn_resnet18 --config data_configs/voc.yaml --world-size 2 --batch-size 32 --workers 2 --epochs 135 --use-train-aug --project-name fasterrcnn_resnet18_voc_aug_135e
After python -m torch.distributed.launch --nproc_per_node=2 --use_env
the usual training command follows.
I got the distributed execution to work. However, I notice two things:
I will check the logging issue that you mention above.
Regarding the best model issues, can you elaborate on whether you are trying to resume training or using the best model weights on some other dataset without resuming?
In any case, if you want to resume training then I recommend using the last_model.pth
as it loads the optimizer state dictionary also. The mAP value with this model when resuming training will be closer to what you stopped the training with. But most probably it will not be the same. I will take 2-3 epochs to orient itself back to the same mAP.
I use last_model.pth and as you can see above, at epoch 200 i resumed training, and it drops in performance and it then starts to produce new best mAP that are saved, although it was better before. Also at epoch 200 I start trainiung on 4 gpus instead of 1 so it might be the reason to why the variance decreased,
Also when I resume training with last_model.pth after trained on 4 GPUs, I get the following error below. It seems that the class head is not saved correctly.
Traceback (most recent call last):
File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 505, in
Thanks for informing this. I will surely check this out. I am not sure how the head will differ on 4 different GPUs compared to a single GPU. But I will check this out.
I also think that the I need to implement SyncBN for distributed training that I have not done yet.
Is there any instruction to follow for creating a distributed execution? Preferably for execution on a single machine with multiple GPUs. I can se that it should be possible but I don't understand how to do it.