lorenmt / mtan

The implementation of "End-to-End Multi-Task Learning with Attention" [CVPR 2019].
https://shikun.io/projects/multi-task-attention-network
MIT License
673 stars 109 forks source link

Unstable evaluation results on NYUv2 #29

Closed Manchery closed 4 years ago

Manchery commented 4 years ago

Hello, sorry to bother you. I ran the same code several times but got unstable evaluate results. Here is the numbers:

MEAN_IOU PIX_ACC ABS_ERR REL_ERR MEAN MED <11.25 <22.5 <30
semantic (2 times) 0.1693 0.548
0.178 0.5685
depth (2 times) 0.6336 0.2739
0.6398 0.2585
normal (3 times) 31.8452 26.0211 0.2161 0.4419 0.5633
30.3995 23.9483 0.2406 0.4779 0.5975
29.788 23.3448 0.2453 0.4861 0.6058
split (2 times) 0.1858 0.5571 0.6323 0.288 32.4637 27.4159 0.2028 0.4184 0.5412
0.164 0.5255 0.6479 0.3159 33.8992 28.8828 0.1806 0.395 0.5174

Details:

  1. results above are the average of last 10 epoches.
  2. I used most of default options, which means I ran:
python model_segnet_single.py --task [task] --dataroot nyuv2

and

python model_segnet_split.py --dataroot nyuv2
  1. I made some modification on the code. There are mainly two:
    • move scheduler.step() according to README
    • change batch size from 2 to 8

UPDATE

Here is the results of the code without any modification (except the location of scheduler.step()) to make my claim more convincing:

ABS_ERR REL_ERR
depth (2 times) 0.6615 0.2859
0.7053 0.2996
lorenmt commented 4 years ago

Hi,

Thanks for your experiments and detailed analysis.

Yes, NYUv2 is a quite complicated dataset with a small number of samples, and that is one of the major reasons to cause such oscillation in the final result. I assume you could observe a much smaller oscillation for the final performance in CityScpaes dataset, since it's easier and contains more samples.

What I can suggest is:

  1. Try to use the benchmark technique I suggested in the readme (relative improved performance over single task learning, (best each task performance in multi-task learning across all validation / single task validation performance)). It should reduce such uncertainty.

  2. Simply run 3 or more times and report the averaged performance.

Hope that helps. Sk.

Manchery commented 4 years ago

Hi,

Thanks for you suggestion. I agree with that this is caused by a small number of samples and I will simply run several times to report average performance.

But for the recommended benchmark, relative improved performance, do you mean $max_{epoch} average_{task}(relative improvement)$, or $average_{task} max_{epoch} (relative improvement)$ ? I think it should be the former one. The last one are less vulnerable to uncertainty but I think it voilates MTL because one purpose of MTL is to reduce inference time so we should use one model to predict all.

lorenmt commented 4 years ago

Yes, you should apply the first version, otherwise, it would break the advantage of doing multi-task learning. For a more detailed explanation for such method, I would suggest to take a look at the paper I linked in the README.

Manchery commented 4 years ago

I think I made a mathematical mistake. The first one are less vulnerable to uncertainty. Forget what I said.

Thank you for your patient replies.

Best wishes.