kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 444 forks source link

Wrong figure recognition #787

Open elonzh opened 3 years ago

elonzh commented 3 years ago

example paper: https://journals.aps.org/prc/abstract/10.1103/PhysRevC.100.014306 (same with #781 )

Fig 1(missed)

image

Fig 2(wrong head and figDesc)

<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0" coords="5,513.20,380.04,36.03,7.88;5,304.15,391.00,4.48,7.88;5,308.63,388.87,4.66,5.98;5,308.63,395.50,2.99,5.25;5,313.79,391.00,32.61,7.88;5,348.49,389.35,5.98,5.25;5,355.00,390.21,193.95,8.97;5,304.15,401.96,245.08,8.57;5,304.15,412.13,245.08,9.29;5,304.15,423.09,245.09,8.97;5,304.15,434.05,173.68,8.97">
    <head>B and the 1 + 1 0</head>
    <label>11</label>
    <figDesc>state of 10 B. The panels (a) are calculated with the THSR + pair wave function pair with optimized parameters. The panels (b) are obtained by using only the pairing term p with parameter c = 0. For all these calculations, β parameters are set to optimized values in the corresponding THSR + pair wave functions.</figDesc>
</figure>

image

Fig 3(correct)

<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1" coords="7,53.09,230.09,101.28,7.88;7,156.69,228.44,5.98,5.25;7,163.20,228.28,123.01,9.69;7,41.14,240.26,245.09,8.97;7,41.14,252.00,104.99,7.88">
    <head>FIG. 3 .</head>
    <label>3</label>
    <figDesc>FIG. 3. Energy curve of the 10 B(3 + 0) with respect to the parameter d. The parameter c is set to be d = 1 − c. Other parameters are fixed at the optimized values.</figDesc>
</figure>

image

v6.2.xml -> v7.0.xml

https://gist.github.com/elonzh/f4e59232ddaded31ee23735f994ea4b6/revisions

both versions have the same issue.

image

kermitt2 commented 3 years ago

I think this is due to the vector graphics, there's still something I need to fix to take them into account when aggregating the figure zones. This is coming from pdfalto which provides the coordinates of vector graphics differently as before, and I think they are mostly not used now (likely one of the reasons why the figure are working so badly overall currently).

elonzh commented 3 years ago

I think this is due to the vector graphics, there's still something I need to fix to take them into account when aggregating the figure zones. This is coming from pdfalto which provides the coordinates of vector graphics differently as before, and I think they are mostly not used now (likely one of the reasons why the figure are working so badly overall currently).

Thanks for your work, I will try to follow up on your work and make some contributions to this project.

kermitt2 commented 3 years ago

So indeed vector graphics are currently not considered when recognizing figures, leading to such errors (all these reported figures are vector graphics).

Some notes before I forget:

kermitt2 commented 3 years ago

for more flexibility we should introduce in pdfalto an option to generate the svg files but not the bitmap files and use the svg option by default

See https://github.com/kermitt2/pdfalto/issues/128

to better parse svg, we probably should use Apache Batik instead of XQueryProcessor. Batik will parse the svg file and provides bounding boxes via the element.getBBox() in VectorGraphicBoxCalculator.

Done in Grobid branch fix-vector-graphics, XQueryProcessor was replaced by parsing of SVG by Apache Batik. Bounding boxes of SVG elements are generated by Apache Batik too, leading to better support of SVG. We reuse then the existing vector box aggregation method, which leads to good figure content recognition.

But this then leads to two problems:

Proposal for the second point: redesign how figures and tables are recognized by removing them from the full text model and introduce their own segmentation models. These model would start from every aggregated graphics box in the document and try to extend these boxes with a dedicated sequence labeling model capturing blocks around based on layout clues+text as usual.

officialsuyogdixit commented 2 years ago

Same here. +1