kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.6k stars 461 forks source link

Same document, different PDF files, same curl command, predictably different output. #1135

Open haykharut opened 4 months ago

haykharut commented 4 months ago

I have 2 PDF versions of a paper, which look exactly the same when inspected visually. The only difference I can detect is file size (2.2MB vs 900KB) and the fact that my PDF viewer will show a contents bar for the big file but not the small file. I am no PDF expert.

I process both files with the command below.

curl -v --form input=@./paper.pdf --form teiCoordinates=ref --form teiCoordinates=biblStruct --form teiCoordinates=figure --form teiCoordinates=persName --form teiCoordinates=formula --form segmentSentences=1 --form teiCoordinates=s https://kermitt2-grobid.hf.space/api/processFulltextDocument > ./paper.xml

The XML outputs differ. Specifically, GROBID will correctly output <graphic coords=... type='bitmap'> for all figures in the small file while it outputs the graphic coords for only 1 figure in the large file, even though it still detects the figures correctly. I am attaching the files for reproducibility.

I would appreciate if someone could help me understand why this happens or at least help me get started with an investigation.

paper_big.pdf paper_small.pdf

lfoppiano commented 4 months ago

Hi @haykharut, thanks for reporing this issue.

The PDF format allow to inject any type of information, including fonts, images. Images may be embedded as bitmap or as vectorial.

Now, although the PDF document looks good, they often smell bad :-) In your examples, I extracted the bitmap using a different application, poppler and I've got the same results, in the small pdf I could extract all the 5 bitmaps, while in the big pdf I could only extract figure 1 and figure 2 (which is composed by three images). This is the reason why Grobid does not attach the graphic tag in the image, because there is no bitmap associated in the big document.

There are other differences in these two documents, for example, paper_big has some hidden content:

image

which is not present in the paper_small:

image
haykharut commented 4 months ago

@lfoppiano thanks so much for getting back. If you don't mind, I would like to ask a couple follow up questions.

Just to make sure I understand -- is it correct to say that in all likelihood, the larger file represents some figures as vectors and others as bitmaps?

In that case, I wonder, how can I extract the coordinates for vectors when bitmaps are missing?

Somewhat bewilderingly, the Grobid HF space processes the larger PDF file correctly. I navigated to the PDF section, selected "include figures and tables" and uploaded the larger file. I can see correctly drawn bounding boxes. However, when I inspect the Network section of Chrome dev tab, I can see that the coordinates for some of the figures are under tables, not figures.

For example, the underlined tab_6 item in the attached picture corresponds to the graphic on the left hand side. It's not a table.

At the same time, the XML file generated by the curl command I mentioned above, references no tab_6 but instead correctly recognizes that item as a figure, even though it misses coordinates.

Screenshot 2024-06-30 at 16 29 45