I recently had the opportunity to read your ERM++ paper, and I'd like to congratulate you on your excellent work. As the author of the DiWA paper, I noticed a few reasons that could explain why your implementation of "DIWA is unable to outperform ERM++".
DiWA requires shared pre-trained initialization, meaning that the classifier should be same linear probed across all runs. In contrast, ERM++ involves independent warm-ups.
DiWA's effectiveness stems from diversity across runs, which can be enhanced by using varied hyperparameters. In contrast, ERM++ employs the same default hyperparameters for all runs.
Introducing additional stochasticity during training can also improve diversity. Thus, DiWA is most effective when used with dropout. In contrast, ERM++ deactivates dropout.
Lastly, moving average reduces diversity across runs, making it poorly compatible with DiWA. Instead, I would suggest averaging only the final/best weights in each run or, even better, averaging only the top k best weights in each run.
It would be helpful to include a simple baseline: DiWA with 5 runs applied to your train/val dataset using your AugMix initialization, hyperparameters with a mild distribution, dropout enabled, unfrozen BN, while averaging only the final weights, and optionally greedy selection.
Hello,
I recently had the opportunity to read your ERM++ paper, and I'd like to congratulate you on your excellent work. As the author of the DiWA paper, I noticed a few reasons that could explain why your implementation of "DIWA is unable to outperform ERM++".
DiWA requires shared pre-trained initialization, meaning that the classifier should be same linear probed across all runs. In contrast, ERM++ involves independent warm-ups.
DiWA's effectiveness stems from diversity across runs, which can be enhanced by using varied hyperparameters. In contrast, ERM++ employs the same default hyperparameters for all runs.
Introducing additional stochasticity during training can also improve diversity. Thus, DiWA is most effective when used with dropout. In contrast, ERM++ deactivates dropout.
Lastly, moving average reduces diversity across runs, making it poorly compatible with DiWA. Instead, I would suggest averaging only the final/best weights in each run or, even better, averaging only the top k best weights in each run.
It would be helpful to include a simple baseline: DiWA with 5 runs applied to your train/val dataset using your AugMix initialization, hyperparameters with a mild distribution, dropout enabled, unfrozen BN, while averaging only the final weights, and optionally greedy selection.
Best regards, Alexandre