[Explore] Where to take vision features?

Donut uses the swin v1 features prior to the final LayerNorm layer (model.norm).

For vit right now we are taking features after the final norm, this is usually the case for many downstream applications but not sure what's best here.

We should compare

final features after norm
final features without norm
final features with class token removed (if it's a vit with class token)
penultimate features (remove one block)
penultimate feature map (remove one stage, for resolution hierarchical models ie swin, a higher res feat map)
FPN (for resolution hierarchical models, merge features across resolution via FPN)

huggingface / pixparse

[Explore] Where to take vision features? #8