thuml / Transfer-Learning-Library

Transfer Learning Library for Domain Adaptation, Task Adaptation, and Domain Generalization
http://transfer.thuml.ai
MIT License
3.39k stars 553 forks source link

In DST, why confidence thresholds are different among baselines? #222

Closed machengcheng2016 closed 10 months ago

machengcheng2016 commented 1 year ago

Greetings! I've been studying your wonderful work DST, recently. I notice that you set different threshold to various baseline methods, which seems like an unfair experimental setting. I wonder why? Thanks!

thucbx99 commented 1 year ago

Hello, this is because we find most baseline methods are sensitive to the choice of confidence threshold. For example, when setting threshold to 0.7, FixMatch will fail to improve over labeled-only training on several datasets due to error accumulation of pseudo labeling. So for each baseline method on each dataset, we will search for the optimal threshold to enable fair comparison. In my opinion, this is like if you want to fairly compare ViT with ResNet, the learning rate can be different because the optimal choice is different.

thucbx99 commented 1 year ago

Besides, we find DST is less sensitive to this choice. Hope this answers your question.

machengcheng2016 commented 1 year ago

Thanks for your reply! On CIFAR-10, with supervised pre-trained model, I got the performance of FixMatch as: threshold acc 0.7 74.7 0.8 84.7 0.9 74.7 0.95 66.7 I am sure that I only change the threshold among these four runs. I wonder the reason why 0.8 leads to 10% higher accuracy. I understand what you replied that DST is less sensitive to threshold setting, since the backbone weights are samely initialized. But do you think that the initialization of last few layers (classification layer) can make such difference on performance?

thucbx99 commented 1 year ago

Sorry for the late reply. Confidence threshold might be the most important hyper parameter for SSL methods. Taking your example here, selecting optimal one (0.8 84.7) or suboptimal one (0.95 66.7) can have great impact on the final performance, especially when we have only few labeled samples (40 for CIFAR-10). For your last question, this phenomenon is aligned with our finding that the classifier head (last few layers) is likely to accumulate pseudo-labeling error. And through backward process, the error can propagate to backbone parameters, which gradually leads to large difference on performance.