kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.58k stars 459 forks source link

[feature request] to skip FullTextParse on certain page region #950

Open frankang opened 2 years ago

frankang commented 2 years ago

The provided model cannot correctly categorize some "vaguely" plotted Figures and Tables. In this case, the word in the Table region will be considered as normal Text, thus hinder the normal reading order. IMHO, one solution is to parse the PDF file and use certain rules to detect Figures and Tables, then we can pass these region information to Grobid to preempt FullTextParse on those "hard" parts. Another solution could be an API exposure for the sequence labeling task, so we can directly pass a manually region-cleared ALTO (xml) file and let Grobid finish the remaining procedures.

kermitt2 commented 2 years ago

Hi @frankang

Thank you for the issue !

The recognition of figure and table zone is indeed one of the two main problems with Grobid currently. I also think that figures and tables should be processed first, upstream other text body parsing.

I though initially to use a R-CNN or LayoutLM approach for figures and tables (it works very well for these objects, not so much for the other coarse ones as compared to Grobid), but this is heavy/slow and there's still the issue of associating well captions, figure/table titles, table notes, etc. So I started with a different approach.

There is an ongoing branch to tackle this problem, the branch is called fix-vector-graphics. Despite the name of the branch, this is a relatively important redesign of the model cascading approach:

So this is consistent with your suggestion of addressing table/figure as first step, but very integrated with the usual cascading approach of Grobid. Progress is very slow, because I have unfortunately very little time for Grobid.

frankang commented 2 years ago

Thanks @kermitt2 , looking forward to it. BTW, what is the other "main problem" about Grobit? Just curious about it, as you said figure and table recognition is one of two "main problems".

kermitt2 commented 2 years ago

@frankang The other main problem for me is that all the models lack training data ! For example there are only 40 training examples for the fulltext model... Each time I add a bit of new training data, the metrics in end-to-end evaluation increase, so it's a bit frustrating that the tool is running with lower accuracy than its capacity.

So if I had 2 months to work only on Grobid, I would spend one to fix the figure/table extraction, and one just producing training data :)