open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.36k stars 2.63k forks source link

Low evaluation scores on pre-trained models #1322

Closed A-guridi closed 2 years ago

A-guridi commented 2 years ago

Describe the issue I am trying to replicate the evaluation results from different Models on different datasets (all supported by MMSegmentation) but I always get really low mIoU scores (~ 9.5 mIoU), even when the results seem good (when plotted)

I have implemented some custom wrappers around MMsegmentation, but leaving the functionalities untouched and using all the recommended apis and classes as in the tutorials.

Reproduction

  1. What command or script did you run?

I am running this custom script, some of the variables are stored in a general purpose class for easier run. The val_dataset itself is the test_dataset and the self.cfg file is the Config file from the corresponding dataset. The config file and the weights are processed from the YAML file (e.g configs/segformer/segformer.yaml), downloaded and the config.py file taking directly from the config file.

The model is created directly with the init_segmentor() function with the same config and the checkpoint path.

data_loader = build_dataloader(self.val_dataset[0], workers_per_gpu=self.cfg.data.workers_per_gpu,
                                       samples_per_gpu=self.cfg.data.samplers_per_gpu, dist=self.multiple_gpu)
model = MMDataParallel(self.model, device_ids=self.cfg.gpu_ids)
results = single_gpu_test(model, data_loader=data_loader, pre_eval=True)
eval_results = self.val_dataset[0].evaluate(results)
print("Final Evaluation Results", eval_results)

No errors or warnings come out during dataset/model building or testing.

  1. What config dir you run?

Different configs, like

segformer_mit-b1_8x1_1024x1024_160k_cityscapes
fcn_hr18_512x1024_40k_cityscapes
fcn_hr48_512x512_80k_potsdam
  1. Did you make any modifications to the code or config? Did you understand what you have modified?

I have not modified the configs file more than just the samplers_per_gpu or workers_per_gpu, different data_roots, but not anything else.

  1. What dataset did you use?

Cityscapes and Potsdam mostly

Environment

{'sys.platform': 'linux', 'Python': '3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:24:11) [GCC 9.4.0]', 'CUDA available': True, 'GPU 0': 'Quadro K2200', 'CUDA_HOME': '/usr/local/cuda', 'NVCC': 'Build cuda_11.3.r11.3/compiler.29745058_0', 'GCC': 'gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0', 'PyTorch': '1.10.2', 'PyTorch compiling details': 'PyTorch built with:\n - GCC 7.3\n - C++ Version: 201402\n - Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.3\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.2\n - Magma 2.5.2\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, \n', 'TorchVision': '0.11.3', 'OpenCV': '4.5.5', 'MMCV': '1.4.4', 'MMCV Compiler': 'GCC 7.3', 'MMCV CUDA Compiler': '11.3', 'MMSegmentation': '0.21.1+bf80039'}

Results

The weird thing is, that I am plotting the results from the networks and their outputs seem almost identical to the groundtruths, which leads me to think that the models I am loading are indeed inferencing correctly the inputs (images also loaded from the same data_loader I am using for the evaluation). The error must be then somehow in the evaluation of said models, but from the few lines I wrote I dont see where the mistake could be.

I have also tried training these models and during validation their scores are also really low, so I dont know if somehow I am loading the models correctly for inference but the evaluation is not working.

Note: the mIoU is printed as 9.5 in percent value and then printed again as absolute value (0.095)

MengzhangLI commented 2 years ago

Hi, I think it is highly probably caused by different GPU number you used, i.e., total batch size is different.

As for Segformer, default gpu number is 8, so if you only use one GPU, you should make batch size 8 times larger than default setting (if you have enough GPU memory).

A-guridi commented 2 years ago

Hello,

I found out that the build_data_loader function sets automatically shuffle=True, thats why my evaluation code was failing to perform. After changing that, the evaluation produces similar results to expected.

The variation due to batch size and GPU number does have a performance impact but not as big as it was in my case.

Anyways, thank you for your help and support !!!