Camelyon17 ERM results seem exceedingly high

qwer1304 commented 2 years ago

Hello,

The Camelyon17 ERM results in Table 4 seem exceedingly high compared to other papers that used the same setup (i.e., ResNet-50 pretrained on ImageNet), e.g.:

"Improving Multi-Domain Generalization through Domain Re-labeling", Table III reports 82.31%.
"Learning domain-agnostic visual representation for computational pathology using medically-irrelevant style transfer augmentation", Table 5 reports 83.3% and 63.1% for some "standard" augmentation strategies. From their code it looks that they fine-tune the whole model and not just the classifier.

Your ERM results are more in-line with oracle-based model selection (e.g., "Towards Principled Disentanglement for Domain Generalization", Table 1 reports 95.6%) which you claim you don't use.

Could you please double check? Thx

khanrc commented 2 years ago

Hi, thank you for your interest in our work.

They have different setups. We adopt the fair comparison protocol from DomainBed for the main experiments, and the same setup is used for Camelyon17 experiments. The results that you referenced look to use WILDS protocol.

We will update the description to be more clear in the next revision. Thanks for pointing it out.

khanrc commented 2 years ago

For reference, we find that "Towards Principled Disentanglement for Domain Generalization" paper also adopts DomainBed protocol, but they use oracle model selection as you mentioned. Here, you can find the performance difference between training-domain (in-domain) and oracle model selection methods in the ERM results (94.9% vs. 95.6%).

qwer1304 commented 2 years ago

I was aware of that reference. However, as you pointed out, they use the oracle selection method. Since you use the leave-one-one method, I was surprised that the difference is so small (94.9%).

khanrc commented 2 years ago

Now I understand which part is causing the confusion. For the model selection, we use training-domain validation. In our paper, "leave-one-out cross-validation" indicates the evaluation protocol, not the model selection. It is a widely used name for the evaluation protocol, but the confusion seems to arise because DomainBed used a similar name for the model selection method.

In addition, with training-domain validation method, the difference (+0.7pp) is not that surprise; the differences are +1.2pp, +0.1pp, -0.1pp, and +0.4pp in PACS, VLCS, OfficeHome, and DomainNet benchmarks, respectively.

We will update the description of this part also in the next revision. Thanks :)

qwer1304 commented 2 years ago

In this case the results for ERM are even stranger. If you do model selection based on training domains (i.e., ID) then why would this model perform well on a test domain (OOD)? Please double-check that test domain data doesn't trickle into training and/or model selection.

khanrc commented 2 years ago

It is not strange result. See the DomainBed paper (https://openreview.net/forum?id=lQdXeXDoWtI). If you still think the results are strange, I think DomainBed repository is a better place to discuss about that.

qwer1304 commented 2 years ago

Unfortunately, the DomainBed paper isn't that helpful because it shows no Camelyon17-WILDS results. The WILDS paper shows a gap of 22.9pp between ID and OOD accuracy using a separate OOD validation set model selection (so this is like leave-one-out). Unfortunately, they used DenseNet-121 models as opposed to ReseNet-50 that you did. Still, the papers that I had listed above show a much lower accuracy for ERM than you do while using ResNet-50. Note that I'm talking about ERM results not your proposed algorithm MIRO.

khanrc commented 2 years ago

If you do model selection based on training domains (i.e., ID) then why would this model perform well on a test domain (OOD)?

DomainBed paper shows that the model selected by training-domain validation performs well on the test domain. As mentioned, the gap between two model selection methods is very small in every benchmark we used except for TerraIncognita.

The WILDS paper shows a gap of 22.9pp between ID and OOD accuracy using a separate OOD validation set model selection (so this is like leave-one-out).

The gap of 22.9pp in the WILDS paper is the gap between ID and OOD performances, not the gap between model selections. There is no reason that the performance difference should be large between the two model selection methods only in Camelyon17 benchmark.

Still, the papers that I had listed above show a much lower accuracy for ERM than you do while using ResNet-50.

There are a lot of difference between DomainBed and WILDS protocols. DomainBed averages all cases of leave-one-out cross-validation results, but WILDS only reports the case of hospital 5 as a target domain. DomainBed uses all domains except for target domain as training domains, but WILDS uses only three domains. In addition, there are various differences in details, such as data augmentation or batch construction. Even if only considering that the DomainBed protocol uses more training domains and the results are relatively lower in the hospital 5 target domain (94.9 in average but 90.7 in hospital 5), it is not strange that the DomainBed protocol shows higher accuracy.

In summary, there is no evidence that our results are strange. However, we will check once again whether there is any problem according to your suggestion. Note that all the code for the ERM experiment uses DomainBed (including Camelyon17), so you can run it and check the code yourself. If you find any issues with the code, please let us know.

khanrc commented 2 years ago

We cannot find any issue after checking the code again. As mentioned, please let us know if you find any issue. IMO, discussion in the DomainBed repository might be helpful since we mostly adopted DomainBed code for the experiment. But still feel free to discuss here if you want.

qwer1304 commented 2 years ago

Here I found ResNet-50 results for different datasets and algorithms using training dataset validation, including Camelyon17-WILDS and ERM that are quite similar to yours. They did not perform hyperparameters sweep; the values they use are here with the defaults here. I can see that (at least some of them) are outside of your sweep range (e.g., learning rate). They do perform learning rate modulation, so their validation effectively chooses early stopping point and learning rate.

So my conclusion is that it's possible to obtain good results with ERM for some hyperparameters choice. Perhaps, DomainBed's sweep range does not include these good choices for Camelyon17. I'd suggest that you include some note/discussion about this in the next revision of the paper to eliminate readers' confusion.

Thanx for your time.

khanrc commented 2 years ago

Thank you for finding the another evidence that our results are not strange. Note that our results are obtained by default parameters of DomainBed. The purpose of this experiment is that the proposed method (MIRO) works even with large distribution shift between pre-training and fine-tuning, such as Camelyon17. So I don't think the existence of better performing HPs is important topic enough to cover in the main paper. We will consider your suggestion in the appendix. Thanks for the suggestion.

khanrc / miro

Camelyon17 ERM results seem exceedingly high #1