Closed youzunzhi closed 1 year ago
Hi, Thanks for your interest in our work. We also noticed this 'outlier' stat. This is an optimization artefact most likely - and we believe that when no strong regularization is present (like heavy augmentations) ViTs behaviour is a bit hard to estimate - this is already improved upon by adding the ConvStem (can be seen from the same comparison). In this particular instance, it is true that linf numbers are slightly worse in Table 1 but unseen threat model numbers are better (this is likely due to no strong augmentation/regularization). All in all we think in low epoch regime as well ViTs require features learned with stronger regularization even if you initialize the model from a slightly better point.
That makes sense to me. Thank you so much for your response!
Thank you for the great work. I am just wondering if you can explain the reason behind the difference between Table 1 and 2. For example, in Table 1, the ViT-S' performance is (60.3, 30.4), while in Table 2 its performance is (61.5, 31.8) where random init and basic augmentation are adopted. I think the difference is that model in Table 1 are pretrained with standard training for 100 epochs while Table 2 use rand init. But if that's the case, why Table 1 is worse than Table 2? Thank you very much!