DiWA implementation - Githubissues

alexrame commented 1 year ago

Hello,

I recently had the opportunity to read your ERM++ paper, and I'd like to congratulate you on your excellent work. As the author of the DiWA paper, I noticed a few reasons that could explain why your implementation of "DIWA is unable to outperform ERM++".

DiWA requires shared pre-trained initialization, meaning that the classifier should be same linear probed across all runs. In contrast, ERM++ involves independent warm-ups.
DiWA's effectiveness stems from diversity across runs, which can be enhanced by using varied hyperparameters. In contrast, ERM++ employs the same default hyperparameters for all runs.
Introducing additional stochasticity during training can also improve diversity. Thus, DiWA is most effective when used with dropout. In contrast, ERM++ deactivates dropout.
Lastly, moving average reduces diversity across runs, making it poorly compatible with DiWA. Instead, I would suggest averaging only the final/best weights in each run or, even better, averaging only the top k best weights in each run.

It would be helpful to include a simple baseline: DiWA with 5 runs applied to your train/val dataset using your AugMix initialization, hyperparameters with a mild distribution, dropout enabled, unfrozen BN, while averaging only the final weights, and optionally greedy selection.

Best regards, Alexandre

piotr-teterwak commented 1 year ago

Hi Alexandre,

Thank you for the kind note and the suggestion! I will run the suggested setting and report back.

Piotr

piotr-teterwak commented 1 year ago

Here are the updated results.

I'll update the arxiv, and then close the issue.

-Piotr

piotr-teterwak / erm_plusplus

DiWA implementation #2