lorenmt / mtan

The implementation of "End-to-End Multi-Task Learning with Attention" [CVPR 2019].
https://shikun.io/projects/multi-task-attention-network
MIT License
673 stars 109 forks source link

results inconsistency #56

Closed NoaGarnett closed 2 years ago

NoaGarnett commented 2 years ago

Hi, Many thanks for the paper and for the most readable code! I tried to reproduce two lines out of table 3 in the paper using your code and following the instructions in the readme file. The results I got are not consistent with those in the paper (not sure how to attach - attaching as spreadsheet and as image): Table_3 mtan_res.ods

All number are average of validation results for the last 10 epochs, as printed by the code. How can such discrepancy be explained?

lorenmt commented 2 years ago

Hello,

Does this inconsistency occur in other networks as well?

At the moment, I would suggest following stronger baseline implementations which apply data augmentation on NYUv2 + ResNet-50 backbone + benchmarking using average relative performance -- which are now the standard in MTL research.

The results in my original paper is a bit outdated.

NoaGarnett commented 2 years ago

Thanks. I repeated the experiments with augmentation, and indeed got better results for both Split-Wide and MTAN, but again, the consistent improvement in MTAN was not reproduced (surface normal is better in this case, but not the other two tasks):

Table_3_2 mtan_res.ods

The fact that better performance can be achieved and that paper results are outdated do not bother me - I'm not interested in results on NYUv2, but on my data and tasks, where I implemented MTAN and got similar results to vanila split. Do you think that using Resnet50 + average relative performance on NYUv2 I will be able to see MTAN advantage?

Many thanks again for your time and for the quick response,

Noa Garnett

lorenmt commented 2 years ago

Though I agree that the performance of MTAN varies based on the task/data and other optimisation techniques. But from my experience and the feedback from other papers, MTAN should produce consistent improvement over Split in a non-trivial margin.

The average relative performance is simply a metric to measure multitask performance in a singular value. But yes, I would first build MTAN on top of ResNet50 as the start, since it's a much stronger backbone after all.

lorenmt commented 2 years ago

Besides, honestly speaking, this is the very first time I heard anyone claiming that MTAN is worse than Split baseline. Though the reported numbers you attached look reasonable, this is definitely not common generally.

NoaGarnett commented 2 years ago

Thanks you again. This is really strange - I used your code as is. Anyway, I'll check with Resnet50 and update. Regards, Noa Garnett

NoaGarnett commented 2 years ago

Hi again, I'm training NYUv2 using resnet50 with / without MTAN, and have a question regarding the decoder implementation in MTANDeepLabv3, and specifically about ASPPPooling: The module first perform global average pooling, which makes each input a single vector, then Conv2d which is actually a linear ("fully-connected") layer, since no spatial information exists, then BatchNorm2d. The input to BatchNorm2d has no spatial dimension, so the number of inputs to be normalized, for each channel, is equal to the batch size. If borrowing hyper parameters from Segnet training script, where batch size is 2, we are left with normalization of two inputs. Does it make sense? I enlarged batch size to 4 on MTANDeepLabv3 training (kept other parameter as before - learning rate, scheduling, number of epochs).

lorenmt commented 2 years ago

I am not exactly sure what's your concern here. The batchnorm should work similar to the SegNet, the ASPP module simply aggregates information from different spatial dimension with different conv layers built with different dilation rates. And yes you can change batch size to any number you want.

NoaGarnett commented 2 years ago

Results using Resnet50 (did not change evaluation method) attached. Table_3_3 mtan_res.ods

Regards, Noa

lorenmt commented 2 years ago

Did you apply pre-trained ImageNet features? The parameters was designed for non-pre-trained features.

NoaGarnett commented 2 years ago

I initialized with torchvision resnet50 weights, as in the code.

see line 14 in https://github.com/lorenmt/mtan/blob/master/im2im_pred/model_resnet_mtan/resnet_mtan.py

    backbone = ResnetDilated(resnet.__dict__['resnet50'](pretrained=True))
lorenmt commented 2 years ago

Yes, sorry. You should have turned that off.

NoaGarnett commented 2 years ago

OK, I'll turn it off and retry. Going on vacation - will update in 10 days. Many thanks.

lorenmt commented 2 years ago

By the way, how you designed the Split, Wide for ResNet-50?

lorenmt commented 2 years ago

To follow up on this:

As you claim is quite serious, I have rerun the experiments and I cannot reproduce your results.

My results: For Split Wide [Last 10 Epochs]; Training Script: python3 model_segnet_split.py --gpu 0 --apply_augmentation --type wide

image

For MTAN [Last 10 Epochs]; Training Script: python3 model_segnet_mtan.py --gpu 0 --apply_augmentation

image

Both networks are built on top of SegNet, with augmentation, and with no modification on the hyper-parameters. MTAN clearly outperforms Split, Wide by a large margin in all 3 tasks.

NoaGarnett commented 2 years ago

Referring to your question from Jan 13 - Indeed I was wrong - ResNet-50 experiment should not have been labeled Split, Wide - I mistakenly copied the label - should be Split, Standard - I did not change the number of parameters in the backbone.

As for the discrepancy in results - a mystery. I asked a friend to try reproducing independently. Hopefully he will get your results and we can carefully look for the problem on my experiments. Will let you know.

lorenmt commented 2 years ago

Any new updates?

NoaGarnett commented 2 years ago

Yes. My friend got results similar to mine. We are preparing a google colab notebook showing the full process we did, starting from git clone, data download and training. We will also share the versions of relevant python packages. Once we're done preparing this notebook (training currently running) we'll share it here, hoping to find the source of results discrepancy. Basically, we get similar results to yours running MTAN, and significantly better then yours (similar to MTAN results, except on surface normals task) using Split, Wide config.

NoaGarnett commented 2 years ago

Hi again. We created two colab notebook, each showing the entire process, starting from git clone, with train results. here is the first, starting with MTAN train, and then <Split,Wide> config. here is the second - starting with <Split, Wide>, then running MTAN.

Both configs has some variance is their results, and in general MTAN results is not better than <Split,Wide>

If you want write permission to these notebooks, so that you can play with it yourself, please let me know.

Please let us know which version of the relevant packages you are using - I can think of no other source for discrepancy.

Regards, Noa

lorenmt commented 2 years ago

Thanks for spending time on this. My training results have also attached for MTAN and Split Wide.

But I have reproduced your results on another machine.

It seems like the performance for Split Wide is not stable.

I wouldn't suggest spending additional time on this... as the performance will also depend on dataset and other hyper-parameters.

Let me know if you have further questions.