Closed sangminwoo closed 4 years ago
it seems like torch > 1.0.x is showing unexpected results. e.g. training faster-rcnn with torch==1.1.0 with multiple gpus shows lower performance than single gpu as dealt in #36 in my case, both torch==1.2.0 and torch==1.3.0 are also showing lower performance when using multiple gpus.
maybe sticking to torch==1.0.x is best choice when training this repo.
Hi @jwyang I'm using two separate machine in different environment, but almost same.
1) RTX 2080 Ti (x 2) python=3.6.9 gcc=7.3.0 torch=1.3.0 cuda=10.1
2) RTX 2080 Ti (x 4) python=3.6.10 gcc=7.3.0 torch=1.3.0 cuda=10.2
While training the model in both machine, the latter one seems to converge too fast. I've double checked two codes are identical.
Below are the captures. First one is converging ordinarily, second one is converging too early. and I don't know why "loss_pred_classifier" is always 0 in latter case.
Any clue to solve this problem?
Thanks in advance
p.s.) both are trained "step-wise" and used same save file for object detection. capture for latter one uses 4gpus, but still shows the same results for the 2gpus.