Implement Vision Grid Transformer for Document Layout Analysis

gregbugaj commented 7 months ago

AlibabaResearch recently published a new model for Document Layout Analysis which sets a new benchmark in the task of Document Layout Analysis.

Introduction - To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding https://arxiv.org/abs/2308.14978

Effect on LLM usage - VGT can dissect the page into different portions (headers, subheaders, titles, etc.) which can then be OCRed and passed to an LLM for RAG.

shuaills commented 6 months ago

I'm working on a similar project and am excited to see that you have already started. I'm curious about your progress. If needed, I can offer my help.

gregbugaj commented 6 months ago

That would be great, I have started looking at Advanced Literate Machinery.

I was not able to obtain the weight to test the model, but it does looks very good.

marieai / marie-ai

Implement Vision Grid Transformer for Document Layout Analysis #100