kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.56k stars 457 forks source link

Grobid fails to extract footnotes #657

Open Kiwifed0r opened 4 years ago

Kiwifed0r commented 4 years ago

Hi!

I'm trying to extract footnotes from pdfs and I'm running into issues. The resulting TEI looks fine in regards to the abstract, sections, references, etc. But the footnotes don't work at all. The footnote anchor ends up as normal text. And the footnote text either disappears completely or also ends up as normal text.

Is Grobid just not trained for footnote detection and I have to train my own model or is there anything else I could try?

kermitt2 commented 4 years ago

Hello @Kiwifed0r

Grobid supports footnotes, it will serialize them after the figures and before the bibliography, e.g.:

            ...
            </figure>
            <note place="foot" n="2">This condition means that we are correctly identifying 
the coordinate location of the horizon to first order. <ref type="bibr" target="#b2">3</ref> 
We use units where G = c =h = 1. <ref type="bibr" target="#b3">4</ref> For coupled 
Einstein-scalar field theory, there would be a contribution to the flux F from the matter 
fields. <ref type="bibr" target="#b4">5</ref> Our convention for the Fourier transform is 
F(k) = e iku F(u)du.</note>
            </body>
                 <back>
                     ...

Normally the content of foot notes never disappears - either it is extracted correctly as such, or it doesn't work and it appears then usually as "normal" text or (worst) as figure caption. If you see some footnote content disappearing, please fill an issue with a test case so that we can reproduce the problem.

However, numbered footnotes are detected I think in around 50% of the cases - depending a lot on the documents (it can be perfect or all missed), so it's not a structure that we can consider as reliable currently. The reason is that there is very few training data for this right now. If you feel inspired by helping with training data, it's the segmentation model that covers this structure.

Kiwifed0r commented 4 years ago

Thank you for the quick reply! I will look into creating training data and also do some more tests in regards to footnotes disappearing.