[BUG] Donut style training (Cruller) w/ vit appears unstable at higher resolution

huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data

11 stars 3 forks source link

[BUG] Donut style training (Cruller) w/ vit appears unstable at higher resolution #9

Open rwightman opened 1 year ago

rwightman commented 1 year ago

Moving higher resolution with a base vit model there appears to be training issues vs lower resolution. As with lower res, the training appears to converge initially, starting to look good and then there is a sudden loss of ability.

@molbap observed this where the OCR metrics appeared to jump back to 100% error (correct me if wrong)

I also observed behaviour similar to this but in my case my train loss also jumped up to a higher value and did not improve afterwards.

molbap commented 1 year ago

Yes, confirming that. In my case training loss was smoothly decreasing, signal only came from train OCR metrics. Eval OCR metrics follow the same pattern. Current steps: trying to reproduce with checkpoints just before instability. Next: check with other base vision models and other resolutions.

molbap commented 11 months ago

With Swin, this kind of instability does not appear at higher resolution but the loss landscape is overall much noisier than with ViT. The pos embeddings are interpolated for a high resolution using timm. With vit, it is not certain to be unstable but the loss can get stuck in a "plateau", not a spike, where OCR metrics are very degraded and do not recover. This does not happen with a swin encoder