kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.61k stars 461 forks source link

grobid skip pages #1053

Open peilin54 opened 1 year ago

peilin54 commented 1 year ago

Hi, We are running Grobid 0.7.3 to extract references from pdf. The pdf has lines number on the left and some watermark at the background. Grobid was able to extract reference, but the number of references it extracted is fewer than what's in the full pdf. We found out that Grobid skipped random pages when processing the pdf reference section. Have you seen that type of issue before and can you share some suggestion to make it work? Thanks!

kermitt2 commented 1 year ago

Hi @peilin54 !

Sorry for the slow answer. It's not something common for sure, it might be that the full content of the page is not classified as reference section, but as annex, and thus the corresponding reference entries are overlooked. The misclassification might be related to the watermark (that might be confusing with a figure element). One solution is to add a few examples of this sort of articles in the training data of the segmentation model. If you can share the document with me and if it is CC-BY, I can have a look!