jwyang / graph-rcnn.pytorch

[ECCV 2018] Official code for "Graph R-CNN for Scene Graph Generation"
727 stars 158 forks source link

Same code, Different Machine, Different Loss #88

Closed sangminwoo closed 4 years ago

sangminwoo commented 4 years ago

Hi @jwyang I'm using two separate machine in different environment, but almost same.

1) RTX 2080 Ti (x 2) python=3.6.9 gcc=7.3.0 torch=1.3.0 cuda=10.1

2) RTX 2080 Ti (x 4) python=3.6.10 gcc=7.3.0 torch=1.3.0 cuda=10.2

While training the model in both machine, the latter one seems to converge too fast. I've double checked two codes are identical.

Below are the captures. First one is converging ordinarily, second one is converging too early. and I don't know why "loss_pred_classifier" is always 0 in latter case.

20200306_173754_converging_well

20200306_173329_converge_too_fast(2)

Any clue to solve this problem?

Thanks in advance

p.s.) both are trained "step-wise" and used same save file for object detection. capture for latter one uses 4gpus, but still shows the same results for the 2gpus.

sangminwoo commented 4 years ago

it seems like torch > 1.0.x is showing unexpected results. e.g. training faster-rcnn with torch==1.1.0 with multiple gpus shows lower performance than single gpu as dealt in #36 in my case, both torch==1.2.0 and torch==1.3.0 are also showing lower performance when using multiple gpus.

maybe sticking to torch==1.0.x is best choice when training this repo.