Multi-epoch training was performed on the test set.

Thanks for the question!

To my knowledge, both single-epoch and multi-epoch are commonly used in prior literature. The code of TTT and Tent use a single epoch, whereas SHOT, another baseline method we compared with, falls into the latter.

I personally lean towards the multi-epoch setting (with an oracle for model selection) for evaluation and comparison. The reason is that, in the single-epoch setting, the adaptation performance is often quite sensitive to the choice of the learning rate, which can lead to noisy comparisons. In contrast, in our multi-epoch evaluation, we chose relatively small learning rates and ran the adaptation for sufficiently long to thoroughly estimate the effectiveness of an algorithm.

Besides, even in practice, I believe that using the test examples at hand for multiple epochs is still a better choice, if computational time allows. This is probably a subjective opinion though.

p.s. Why do you think multiple-epoch does not make sense in practice?

vita-epfl / ttt-plus-plus

Multi-epoch training was performed on the test set. #3