About parameters in the Re-Train step.

moonfoam commented 4 years ago

I have trained the coarse network following the script in the './scripts/train/train_cityscapes_ResNet50_deeplab.sh' for 97 epochs, and got a nice base_model.

However, when I try to refine it with the script of './scripts/train/train_ciytscapes_ResNet50_deeplab_decouple.sh', it crashed with non, because of no valid label.

I use 2xRTX3090 GPU with cuda11.1 and pytorch 1.17, maybe it caused by the different GPUs, so firstly I want to check out whether the parameters in this script is what you used (for 180 epochs totally)?

the script I used are as following:

train_cityscapes_ResNet50_deeplab.sh

#!/usr/bin/env bash
now=$(date +"%Y%m%d_%H%M%S")
EXP_DIR=./body_edge/ResNet50_deeplab_cie_pretrain97
mkdir -p ${EXP_DIR}
# Example on Cityscapes by resnet50-deeplabv3+ as baseline
python -m torch.distributed.launch --nproc_per_node=2 train.py \  # <---- which is stop in 97 epochs
  --dataset cityscapes \
  --cv 0 \
  --arch network.deepv3.DeepR50V3PlusD_m1_deeply \
  --class_uniform_pct 0.0 \
  --class_uniform_tile 1024 \
  --max_cu_epoch 150 \
  --lr 0.01 \
  --lr_schedule poly \
  --poly_exp 1.0 \
  --repoly 1.5  \
  --rescale 1.0 \
  --syncbn \
  --sgd \
  --crop_size 832 \
  --scale_min 0.5 \
  --scale_max 2.0 \
  --color_aug 0.25 \
  --gblur \
  --max_epoch 180 \
  --ohem \
  --wt_bound 1.0 \
  --bs_mult 4 \
  --apex \
  --exp cityscapes_bs8_pretrain97 \
  --ckpt ${EXP_DIR}/ \
  --tb_path ${EXP_DIR}/ \
  2>&1 | tee  ${EXP_DIR}/log_${now}.txt &

train_ciytscapes_ResNet50_deeplab_decouple.sh

#!/usr/bin/env bash
now=$(date +"%Y%m%d_%H%M%S")
EXP_DIR=./body_edge/ResNet50_FCN_m4_decouple_ft_83_e
mkdir -p ${EXP_DIR}
# Example on Cityscapes by resnet50-deeplabv3+ as baseline
python -m torch.distributed.launch --nproc_per_node=2 train.py \
  --dataset cityscapes \
  --cv 0 \
  --snapshot body_edge/ResNet50_deeplab_cie_pretrain97/cityscapes_bs8_pretrain97/city-network.deepv3.DeepR50V3PlusD_m1_deeply_apex_T_bs_mult_4_class_uniform_pct_0.0_crop_size_832_cv_0_ie_T_lr_0.01_ohem_T_sbn/best_epoch_97_mean-iu_0.77674_flag.pth \
  --arch network.deepv3_decouple.DeepR50V3PlusD_m1_deeply \
  --class_uniform_pct 0.5 \
  --class_uniform_tile 1024 \
  --max_cu_epoch 150 \
  --lr 0.005 \
  --lr_schedule poly \
  --poly_exp 1.0 \
  --repoly 1.5  \
  --rescale 1.0 \
  --syncbn \
  --sgd \
  --crop_size 832 \
  --scale_min 0.5 \
  --scale_max 2.0 \
  --color_aug 0.25 \
  --gblur \
  --max_epoch 83 \
  --ohem \
  --jointwtborder \
  --joint_edgeseg_loss \
  --wt_bound 1.0 \
  --bs_mult 4 \
  --apex \
  --exp cityscapes_ft_83 \
  --ckpt ${EXP_DIR}/ \
  --tb_path ${EXP_DIR}/ \
  2>&1 | tee  ${EXP_DIR}/log_${now}.txt &

lxtGH commented 4 years ago

It looks good to me. I didn't meet such problems. The only different setting is I use 8 V100 with 1 image per gpu

moonfoam commented 4 years ago

@lxtGH Thanks for your reply, I think it's not much difference with syncBN when the batch size keep same. Here is my log file, I just reduce the learning rate from 0.005 to 0.002, it seems running well. However it suddenly cracked and i don't know why. Do you know what happened? log_20201002_142033.txt

lxtGH commented 4 years ago

It looks very strange. Iooks like the body loss become NAN. Maybe you try lower down the weights of body loss?

moonfoam commented 4 years ago

@lxtGH Ok, I will check the output and adjust the parameters I used.

moonfoam commented 4 years ago

After setting the LR to 0.001, it run 20 epoch stably, and the problem seems solved. I guess it may owing to the DDP training.

zkluo commented 3 years ago

I meet the same error, then i find the solution in https://github.com/NVIDIA/semantic-segmentation/issues/29#issuecomment-560472406 works for me. The NaN problem may be caused by the SGD.

lxtGH / DecoupleSegNets

About parameters in the Re-Train step. #11

the script I used are as following: