Can't reproduce model zoo results for upernet_deit-s16_512x512_80k_ade20k

RoganInglis commented 2 years ago

I've trained the upernet_deit-s16_512x512_80k_ade20k model but it has only reached a final mIoU of 0.2 compared to the 0.4296 value reported in the model zoo. When running the same evaluation script but using the trained model from the model zoo I see an mIoU of 0.4338.

Here are the logs from my training run in comparison to the ones from the model zoo here.

I downloaded the data from http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip and the pretrained model from https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth and stored them in the default locations. I trained using a docker image built from the official dockerfile, except with RUN pip install tensorboard added at the end to enable tensorboard logging. Similarly I used the configs exactly from the repo other than including the TensorboardLoggerHook line in the default_runtime.py config file.

I trained the model using this command:

CONFIG_FILE=configs/vit/upernet_deit-s16_512x512_80k_ade20k.py
NUM_GPU=8
bash tools/dist_train.sh ${CONFIG_FILE} ${NUM_GPU}

Does anyone have any ideas about what might be the cause?

MeowZheng commented 2 years ago

would you mind providing the full config and where are the defalut locations?

RoganInglis commented 2 years ago

The default location for the pretrained checkpoint is mmsegmentation/pretrain/deit_small_patch16_224-cd65a155.pth, as referenced in the config at mmsegmentation/configs/vit/upernet_deit-s16_512x512_80k_ade20k.py. For the dataset it is mmsegmentation/data/ade/ADEChallengeData2016/ as referenced in mmsegmentation/configs/_base_/datasets/ade20k.py.

This is the full config for my run.

Thanks

open-mmlab / mmsegmentation

Can't reproduce model zoo results for upernet_deit-s16_512x512_80k_ade20k #1933