huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data
11 stars 3 forks source link

[Explore] Vision architecture comparisons at 1280x960 #7

Open rwightman opened 1 year ago

rwightman commented 1 year ago

There are many possible vision arch to try other than the Donut choice of swin v1 or the common choice of vanilla (or modified) vit. Should make an effort to explore the options as we run experiments.

NOTE: exact weight instances TBD

Possibly

molbap commented 11 months ago

Checking out the list, I tried several flavors of vit/swin including lower-capacity swin models (doesn't fare as well as 384, even with interpolated pos emb). Overall the sporadic instability of vit at very high resolutions #9 is a bottleneck for trying out these models, but any feedback on these configurations will be helpful later on.