huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data
11 stars 3 forks source link

[Explore] pooling strategies on vision encoder #12

Open molbap opened 1 year ago

molbap commented 1 year ago

When it comes to finetuning a pretrained pixparse model on document classification or layout prediction, we have a few options

Initial thoughts/answers For a pure classification/layout prediction, we’d use the image encoder only and add back pooling and a classifier head Currently we’re using vit with classification tokens (the default for most of the standard and CLIP vits), but we aren’t pooling, so we pass the sequence that class token + spatial tokens to the text model, though we might want to compare stripping the class token and just passing the spatial token ---> need to compare average pooling v token pooling v just spatial tokens For OCR: using the text decoder might make sense too, adding a classification head to that, it’d probably do better but a lot more params (Ross) thinks (I agree) we’re trying to avoid OCR for anything but dataset prep in this work so not keen on the last option but it’d work as others do it -->DocFormer, LayoutLM, a few others do it with external OCR, at least we should do it with our own OCR engine. Ultimately we want to get rid of OCR in the ML pipeline.

TODO:

So, list of things to compare is (for now), for classification/layout prediction, factored by pooling strategies