tensorflow / models

Models and examples built with TensorFlow
Other
77k stars 45.78k forks source link

Deeplab - num_clones unexpected convergence behaviour #7815

Open matteopresutto opened 4 years ago

matteopresutto commented 4 years ago

System information

Describe the problem

There are some differences between convergence on runs done with num_clones>1 and runs done on a single GPU even if the minibatch size is the same. I have isolated this problem by training the same model on the same dataset with exactly the same configurations with minibatch 6 on a Tesla P100 with no clones and with 3 clones. These are the results of, for example, the regularization loss: regularization_loss_3_clones refularization_loss_no_clones

The first image is the regularization loss with 3 clones, the second one is without clones. Their convergence pattern doesn't match. Taking another experiment in comparison: in the following two runs, the configurations at training time are exactly the same (only 1 clone) and the only thing that changes between the two is that the dataset is preprocessed by a simple script I made: regularization_loss As evident, they are very close to each other even if the data was preprocessed in one and the other not. Did anyone else experience this problem?

@aquariusjay

This is the script I use to train and monitor without clones (with clones I just add --num_clones=3):

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=1000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --base_learning_rate=0.0001 \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size='513,513' \
    --train_batch_size=6 \
    --dataset=<my_data_name> \
    --tf_initial_checkpoint="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/pretrained/deeplabv3_pascal_trainval/model.ckpt" \
    --train_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --dataset_dir=<data_dir>

python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size='1201,1201' \
    --dataset=<my_data>\
    --checkpoint_dir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --eval_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs/eval" \
    --dataset_dir=<my_data>

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=2000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --base_learning_rate=0.0001 \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size='513,513' \
    --train_batch_size=6 \
    --dataset=<my_data_name> \
    --tf_initial_checkpoint="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/pretrained/deeplabv3_pascal_trainval/model.ckpt" \
    --train_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --dataset_dir=<data_dir>

python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size='1201,1201' \
    --dataset=<my_data>\
    --checkpoint_dir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --eval_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs/eval" \
    --dataset_dir=<my_data>

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=3000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --base_learning_rate=0.0001 \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size='513,513' \
    --train_batch_size=6 \
    --dataset=<my_data_name> \
    --tf_initial_checkpoint="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/pretrained/deeplabv3_pascal_trainval/model.ckpt" \
    --train_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --dataset_dir=<data_dir>

python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size='1201,1201' \
    --dataset=<my_data>\
    --checkpoint_dir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --eval_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs/eval" \
    --dataset_dir=<my_data>

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=4000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --base_learning_rate=0.0001 \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size='513,513' \
    --train_batch_size=6 \
    --dataset=<my_data_name> \
    --tf_initial_checkpoint="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/pretrained/deeplabv3_pascal_trainval/model.ckpt" \
    --train_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --dataset_dir=<data_dir>

python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size='1201,1201' \
    --dataset=<my_data>\
    --checkpoint_dir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --eval_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs/eval" \
    --dataset_dir=<my_data>

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=5000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --base_learning_rate=0.0001 \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size='513,513' \
    --train_batch_size=6 \
    --dataset=<my_data_name> \
    --tf_initial_checkpoint="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/pretrained/deeplabv3_pascal_trainval/model.ckpt" \
    --train_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --dataset_dir=<data_dir>

python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size='1201,1201' \
    --dataset=<my_data>\
    --checkpoint_dir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs" \
    --eval_logdir="$(echo ~)/semantic_segmentation/tensorflow-models/research/deeplab/trainlogs/eval" \
    --dataset_dir=<my_data>

With --num_clones=3 it is using 3 clones indeed and nvidia-smi looks like this: +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 46390 C python 8679MiB | | 1 46390 C python 9191MiB | | 2 46390 C python 9191MiB | | 3 46390 C python 281MiB | +-----------------------------------------------------------------------------+

matteopresutto commented 4 years ago

@YknZhu @huihui-personal @gpapan I am adding new evidence for this issue with the hope of shining more light on it. This is how the runs I described above converge and how they compare on my dataset: The IOUs for all the classes I have IOU_class0 IOU_class1 IOU_class2 IOU_class3

Precisions: precision_class0 precision_class1 precision_class2 precision_class3

Recalls: recall_class0 recall_class1 recall_class2 recall_class3

Learning Rate comparison: learning_rate

Logits losses: logits_loss

Mean IOU miou

Regularization loss: regularization_loss

And the total loss: total_loss

There are differences in the logit loss, regularization loss and total loss

YknZhu commented 4 years ago

Thanks for the detailed analysis! Note that using multiple clones doesn't guarantee exactly same behavior with single clone and larger batch size (e.g. BN).

We also updated clone implementation (revert back to slim) in https://github.com/tensorflow/models/pull/7788. Would you like to give it a try?

matteopresutto commented 4 years ago

Thanks for the reply @YknZhu and apologies for the late response. I digged deeper on the issue following your advice and indeed the batchnorm update_ops directives in the update vector of tensorflow are separate for each clone. It seems that the moving averages and moving variances are being updated locally for each clone while the gammas and betas are being updated synchronously (as they are actually updated with grandients which are collected and averaged before each descent step). Does the new #7788 commit tackle the asynchronicity of the update_ops for moving averages and variances?

P.S.: my cloned version is from late august of this year