[Explore] pooling strategies on vision encoder

When it comes to finetuning a pretrained pixparse model on document classification or layout prediction, we have a few options

pure vision, use only pooled vision encoder features.
use text decoder, use combination of vision encoder features and text decoder features (no leakage I mean, the input is still just an image)
use vision encoder, external OCR, and combine both (like DocFormer did)

Initial thoughts/answers For a pure classification/layout prediction, we’d use the image encoder only and add back pooling and a classifier head Currently we’re using vit with classification tokens (the default for most of the standard and CLIP vits), but we aren’t pooling, so we pass the sequence that class token + spatial tokens to the text model, though we might want to compare stripping the class token and just passing the spatial token ---> need to compare average pooling v token pooling v just spatial tokens For OCR: using the text decoder might make sense too, adding a classification head to that, it’d probably do better but a lot more params (Ross) thinks (I agree) we’re trying to avoid OCR for anything but dataset prep in this work so not keen on the last option but it’d work as others do it -->DocFormer, LayoutLM, a few others do it with external OCR, at least we should do it with our own OCR engine. Ultimately we want to get rid of OCR in the ML pipeline.

TODO:

So, list of things to compare is (for now), for classification/layout prediction, factored by pooling strategies

performance on image encoder only
performance on image encoder + text generated by text decoder

huggingface / pixparse

[Explore] pooling strategies on vision encoder #12