[Explore] Vision architecture comparisons at 1280x960

huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data

11 stars 3 forks source link

There are many possible vision arch to try other than the Donut choice of swin v1 or the common choice of vanilla (or modified) vit. Should make an effort to explore the options as we run experiments.

NOTE: exact weight instances TBD

vit_base (original, clip, eva, beit variants)
- vit_base_patch16_224.augreg_in21k
- vit_base_patch16_clip_224.datacompxl
- eva02_base_patch14_224.mim_in22k
convnext_base (w/o attention pooling)
- convnext_base.clip_laiona_augreg_320
- convnext_base.fb_in22k
convnext_base (w/ attention pooling, needs work)
- TBD
swin_base
swinv2_base
maxvit_small_tf
maxvit_rmlp_small_rw
coatnet_rmlp_2_rw_224

Possibly

visformer_small
davit_base
caformer_b36 / caformer_m36
convformer_b36 / convformer_b36

huggingface / pixparse

[Explore] Vision architecture comparisons at 1280x960 #7