Best miou in the second stage gets lower

zkluo commented 3 years ago

Hi, I follow the two stage training instruction to reproduce the model, but I find the best miou (0.783) of the model in the second stage (deeplabv3-decouple) is lower than that (0.786) in the first stage (deeplabv3). I don't know why. Following is my logs:

First I run sh ./scripts/train/train_cityscapes_ResNet50_deeplab.sh to train the base model:

#!/usr/bin/env bash
now=$(date +"%Y%m%d_%H%M%S")
EXP_DIR=./body_edge/ResNet50_FCN_m4_decouple_ft_175_e
mkdir -p ${EXP_DIR}
# Example on Cityscapes by resnet50-deeplabv3+ as baseline
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port 29501 train.py \
  --dataset cityscapes \
  --cv 0 \
  --arch network.deepv3.DeepR50V3PlusD_m1_deeply \
  --class_uniform_pct 0.0 \
  --class_uniform_tile 1024 \
  --max_cu_epoch 150 \
  --lr 0.01 \
  --lr_schedule poly \
  --poly_exp 1.0 \
  --repoly 1.5  \
  --rescale 1.0 \
  --syncbn \
  --sgd \
  --crop_size 832 \
  --scale_min 0.5 \
  --scale_max 2.0 \
  --color_aug 0.25 \
  --gblur \
  --max_epoch 80 \
  --ohem \
  --ohem \
  --wt_bound 1.0 \
  --bs_mult 1 \
  --exp cityscapes_ft \
  --ckpt ${EXP_DIR}/ \
  --tb_path ${EXP_DIR}/ \
  --apex
  2>&1 | tee  ${EXP_DIR}/log_${now}.txt &

Then I get a deeplabv3 model with reasonable performance (miou 0.786), log detail: log_2020_11_21_02_07_50_rank_0.log

Next, I run sh ./scripts/train/train_ciytscapes_ResNet50_deeplab_decouple.sh for the second stage training:

#!/usr/bin/env bash
now=$(date +"%Y%m%d_%H%M%S")
EXP_DIR=./body_edge/ResNet50_FCN_m4_decouple_ft_83_e_20201205_1
mkdir -p ${EXP_DIR}
# Example on Cityscapes by resnet50-deeplabv3+ as baseline
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port 29503 train.py \
  --dataset cityscapes \
  --cv 0 \
  --snapshot ./body_edge/ResNet50_FCN_m4_decouple_ft_175_e/cityscapes_ft/city-network.deepv3.DeepR50V3PlusD_m1_deeply_apex_T_bs_mult_1_class_uniform_pct_0.0_crop_size_832_cv_0_lr_0.01_ohem_T_sbn/best_epoch_79_mean-iu_0.78649.pth \
  --arch network.deepv3_decouple.DeepR50V3PlusD_m1_deeply \
  --class_uniform_pct 0.5 \
  --class_uniform_tile 1024 \
  --max_cu_epoch 150 \
  --lr 0.001 \
  --lr_schedule poly \
  --poly_exp 1.0 \
  --repoly 1.5  \
  --rescale 1.0 \
  --sgd \
  --crop_size 832 \
  --scale_min 0.5 \
  --scale_max 2.0 \
  --color_aug 0.25 \
  --gblur \
  --max_epoch 83 \
  --ohem \
  --jointwtborder \
  --syncbn \
  --apex \
  --joint_edgeseg_loss \
  --wt_bound 1.0 \
  --bs_mult 2 \
  --exp cityscapes_ft_83_20201205_1 \
  --ckpt ${EXP_DIR}/ \
  --tb_path ${EXP_DIR}/ \
  2>&1 | tee  ${EXP_DIR}/log_${now}.txt &

The difference with your setting lies in:

8 gpus 1 bs -> 4 gpus 2 bs
175 epoch -> 83 epoch
lr 0.005 -> lr 0.001
the solution to solve the loss nan issue: https://github.com/lxtGH/DecoupleSegNets/issues/19#issuecomment-739658296

However, I get a bad result (miou 0.783 < 0.786) despite the reduced lr and epoch num, log details: log_20201205_095232.txt

I feel confused. Could you help me?

bobp26 commented 3 years ago

Hi, have you solved this problem? According to the author's script training on other data sets, I have modified the learning rate and epoch many times, but the accuracy cannot exceed the effect of the first stage.

lxtGH commented 3 years ago

@bobp26 @zkluo Hi, It is a little strange. I did not maintain this codebase anymore. However, recently, I found that do not using boundary relaxtion loss leads to better performance. Using mmseg(https://github.com/open-mmlab/mmsegmentation), the performance is better performance than the paper. I will make a pull request on mmseg.

lxtGH commented 3 years ago

refer to this issue. https://github.com/lxtGH/DecoupleSegNets/issues/22

stillwaterman commented 3 years ago

@bobp26 @zkluo Hi, It is a little strange. I did not maintain this codebase anymore. However, recently, I found that do not using boundary relaxtion loss leads to better performance. Using mmseg(https://github.com/open-mmlab/mmsegmentation), the performance is better performance than the paper. I will make a pull request on mmseg.

so, what kind of loss function leads to better performance than boundary relaxtion loss? Or you mean network do not need body part loss to participate in training

lxtGH / DecoupleSegNets

Best miou in the second stage gets lower #20