[feature request] to skip FullTextParse on certain page region

frankang commented 2 years ago

The provided model cannot correctly categorize some "vaguely" plotted Figures and Tables. In this case, the word in the Table region will be considered as normal Text, thus hinder the normal reading order. IMHO, one solution is to parse the PDF file and use certain rules to detect Figures and Tables, then we can pass these region information to Grobid to preempt FullTextParse on those "hard" parts. Another solution could be an API exposure for the sequence labeling task, so we can directly pass a manually region-cleared ALTO (xml) file and let Grobid finish the remaining procedures.

kermitt2 commented 2 years ago

Hi @frankang

Thank you for the issue !

The recognition of figure and table zone is indeed one of the two main problems with Grobid currently. I also think that figures and tables should be processed first, upstream other text body parsing.

I though initially to use a R-CNN or LayoutLM approach for figures and tables (it works very well for these objects, not so much for the other coarse ones as compared to Grobid), but this is heavy/slow and there's still the issue of associating well captions, figure/table titles, table notes, etc. So I started with a different approach.

There is an ongoing branch to tackle this problem, the branch is called fix-vector-graphics. Despite the name of the branch, this is a relatively important redesign of the model cascading approach:

figure and table zones are identified prior to the segmentation model (new models are called figure-segmentation and table-segmentation). These models are anchored on clustered graphic elements (vector graphics and bitmap) and are trained to extend up and down the zones, eventually resulting to well formed figure/table areas with captions, etc. figures with several images, etc. or rejecting graphics as mere publisher decorations.
the segmentation model then applies on the content without these zones, as well as then the full text parser, which are simplified because the very noisy figure/table content is removed.
the branch comes with a more advanced processing of vector graphics to avoid slow heavy and possibly very slow process like rasterizing.

So this is consistent with your suggestion of addressing table/figure as first step, but very integrated with the usual cascading approach of Grobid. Progress is very slow, because I have unfortunately very little time for Grobid.

frankang commented 2 years ago

Thanks @kermitt2 , looking forward to it. BTW, what is the other "main problem" about Grobit? Just curious about it, as you said figure and table recognition is one of two "main problems".

kermitt2 commented 2 years ago

@frankang The other main problem for me is that all the models lack training data ! For example there are only 40 training examples for the fulltext model... Each time I add a bit of new training data, the metrics in end-to-end evaluation increase, so it's a bit frustrating that the tool is running with lower accuracy than its capacity.

So if I had 2 months to work only on Grobid, I would spend one to fix the figure/table extraction, and one just producing training data :)

kermitt2 / grobid

[feature request] to skip FullTextParse on certain page region #950