Open rwightman opened 1 year ago
Yes, confirming that. In my case training loss was smoothly decreasing, signal only came from train OCR metrics. Eval OCR metrics follow the same pattern. Current steps: trying to reproduce with checkpoints just before instability. Next: check with other base vision models and other resolutions.
With Swin, this kind of instability does not appear at higher resolution but the loss landscape is overall much noisier than with ViT. The pos embeddings are interpolated for a high resolution using timm. With vit, it is not certain to be unstable but the loss can get stuck in a "plateau", not a spike, where OCR metrics are very degraded and do not recover. This does not happen with a swin encoder
Moving higher resolution with a base vit model there appears to be training issues vs lower resolution. As with lower res, the training appears to converge initially, starting to look good and then there is a sudden loss of ability.
@molbap observed this where the OCR metrics appeared to jump back to 100% error (correct me if wrong)
I also observed behaviour similar to this but in my case my train loss also jumped up to a higher value and did not improve afterwards.