Open matteopresutto opened 4 years ago
@YknZhu @huihui-personal @gpapan I am adding new evidence for this issue with the hope of shining more light on it. This is how the runs I described above converge and how they compare on my dataset: The IOUs for all the classes I have
Precisions:
Recalls:
Learning Rate comparison:
Logits losses:
Mean IOU
Regularization loss:
And the total loss:
There are differences in the logit loss, regularization loss and total loss
Thanks for the detailed analysis! Note that using multiple clones doesn't guarantee exactly same behavior with single clone and larger batch size (e.g. BN).
We also updated clone implementation (revert back to slim) in https://github.com/tensorflow/models/pull/7788. Would you like to give it a try?
Thanks for the reply @YknZhu and apologies for the late response. I digged deeper on the issue following your advice and indeed the batchnorm update_ops directives in the update vector of tensorflow are separate for each clone. It seems that the moving averages and moving variances are being updated locally for each clone while the gammas and betas are being updated synchronously (as they are actually updated with grandients which are collected and averaged before each descent step). Does the new #7788 commit tackle the asynchronicity of the update_ops for moving averages and variances?
P.S.: my cloned version is from late august of this year
System information
What is the top-level directory of the model you are using: tensorflow-models/research/deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): I did
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.6 LTS
TensorFlow installed from (source or binary): Installed from binary
TensorFlow version (use command below): Tensorflow 1.13.1
Bazel version (if compiling from source):
CUDA/cuDNN version: 7.4.2
GPU model and memory: NVidia Tesla P100, 16280MiB
Exact command to reproduce: //
Describe the problem
There are some differences between convergence on runs done with num_clones>1 and runs done on a single GPU even if the minibatch size is the same. I have isolated this problem by training the same model on the same dataset with exactly the same configurations with minibatch 6 on a Tesla P100 with no clones and with 3 clones. These are the results of, for example, the regularization loss:
The first image is the regularization loss with 3 clones, the second one is without clones. Their convergence pattern doesn't match. Taking another experiment in comparison: in the following two runs, the configurations at training time are exactly the same (only 1 clone) and the only thing that changes between the two is that the dataset is preprocessed by a simple script I made: As evident, they are very close to each other even if the data was preprocessed in one and the other not. Did anyone else experience this problem?
@aquariusjay
This is the script I use to train and monitor without clones (with clones I just add --num_clones=3):
With --num_clones=3 it is using 3 clones indeed and nvidia-smi looks like this: +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 46390 C python 8679MiB | | 1 46390 C python 9191MiB | | 2 46390 C python 9191MiB | | 3 46390 C python 281MiB | +-----------------------------------------------------------------------------+