Open peilin54 opened 1 year ago
Hi @peilin54 !
Sorry for the slow answer. It's not something common for sure, it might be that the full content of the page is not classified as reference section, but as annex, and thus the corresponding reference entries are overlooked. The misclassification might be related to the watermark (that might be confusing with a figure element). One solution is to add a few examples of this sort of articles in the training data of the segmentation model. If you can share the document with me and if it is CC-BY, I can have a look!
Hi, We are running Grobid 0.7.3 to extract references from pdf. The pdf has lines number on the left and some watermark at the background. Grobid was able to extract reference, but the number of references it extracted is fewer than what's in the full pdf. We found out that Grobid skipped random pages when processing the pdf reference section. Have you seen that type of issue before and can you share some suggestion to make it work? Thanks!