huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data
11 stars 1 forks source link

[Explore] Where to take vision features? #8

Open rwightman opened 12 months ago

rwightman commented 12 months ago

Donut uses the swin v1 features prior to the final LayerNorm layer (model.norm).

For vit right now we are taking features after the final norm, this is usually the case for many downstream applications but not sure what's best here.

We should compare

molbap commented 11 months ago

Continued in #12