thuml / Self-Tuning

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)
109 stars 14 forks source link

huge performance gap between the reported number and reproduced one on Fine-Tuning method #6

Open HeimingX opened 2 years ago

HeimingX commented 2 years ago

Hi,

Thanks for the interesting work and sharing the code.

Recently, I reproduced the Fine-Tuning baseline method based on the released code for Self-Tuning method (directly delete the unlabeled and contrastive parts and use the same optim hyperparam and schedule), and the reproduced results are as follows (all experiments are conducted on 15% label proportion setting):

Dataset FT-reported FT-reproduced
CUB 45.25 48.43
Standford Cars 36.77 53.09
FGVC Aircraft 39.57 53.65

As the table shown, there is a huge performance gap between the reported numbers and the reporduced ones. Furthermore, I also found some reproduced numbers even much better than the reported numbers of SSL methods. As shown in the following table, the performance gap is quite unreasonable since large amount of unlabeled samples have been further utilized in these SSL methods.

Dataset FT-reproduced PI-model pseudo-labeling UDA Fixmatch
CUB 48.43 45.20 45.33 46.90 44.06
Standford Cars 53.09 45.19 40.93 39.90 49.86
FGVC Aircraft 53.65 37.32 46.83 43.96 55.53

So, I am really wondering how do you train the baseline methods to get the reported numbers?

wxm17 commented 2 years ago

Thanks for paying attention to Self-Tuning.

As for the fine-tuning baselines, we reproduced them using the source code of Co-Tuning (https://github.com/thuml/CoTuning). Since almost all fine-tuning baselines are reproducible using the code of Co-Tuning, we directly showed the numbers of these baselines reported in Table 2 of Co-Tuning (https://papers.nips.cc/paper/2020/file/c8067ad1937f728f51288b3eb986afaa-Paper.pdf).

I will try to figure out the reasons behind it. At this point, I guess that the reasons for the gap between the reported numbers and the reproduced ones may be caused by the different data augmentation methods. Following the source code of FixMatch, we used the RandAugment method which was also used by FixMatch and some of SSL baselines. If you directly delete the unlabeled and contrastive parts, you will use the better RandAugment method than the normal data augmentation methods defined in the original fine-tuning methods.

As reported in the Baselines of Section 5, we showed that "FixMatch, UDA, and Self-Tuning use the same RandAugment method, while other baselines use normal ones." Since data augmentation methods are the main contributions of some SSL baselines, we didn't make all baseline use the same data augmentation methods.

HeimingX commented 2 years ago

Thanks for the prompt response!

The reason for performance gap between reported and reproduced results on Fine-Tuning is reasonable.

But, I am still unconvinced for the performance gap in the second table (FT-reproduced vs. SSL baselines). In my mind, the only difference between FT-reproduced and SSL methods (e.g., FixMatch, UDA) is the utilizing of unlabled samples. If it is the case, that means the unlabeled samples (with same label space) are harmful for learning or optimization which needs to be proved and verified carefully. So I wonder if it is possible for you to further release the codes for baseline methods to further support all the reported numbers? Thanks a lot.