kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.62k stars 461 forks source link

Incomplete teiHeader extracted for paper #520

Open bananaoomarang opened 5 years ago

bananaoomarang commented 5 years ago

I understand that Grobid will not extract perfect metadata in every case, but I think this one is interesting nonetheless.

For 10.1016/S0140-6736(99)01239-8 neither the title nor the authors are present in the teiHeader section of the document when I run it through Grobid.

Title is blank and authors look like:

<biblStruct>
        <analytic>
                <author>
                        <affiliation key="aff0">
                                <orgName type="department">Department of Community Child Health</orgName>
                                <orgName type="institution" key="instit1">Royal Free Campus</orgName>
                                <orgName type="institution" key="instit2">Royal Free and University College Medical School</orgName>
                                <orgName type="institution" key="instit3">University College London</orgName>
                                <address>
                                        <postCode>NW3 2QG</postCode>
                                        <settlement>London</settlement>
                                        <country key="GB">UK</country>
                                </address>
                        </affiliation>
                </author>
                <author>
                        <affiliation key="aff1">
                                <orgName type="department">Department of Statistics</orgName>
                                <orgName type="laboratory">Immunisation Division, Public Health Laboratory Service Communicable Disease Surveillance Centre, London (E Miller FRCPath, P A Waight BSc);</orgName>
                                <orgName type="institution">Open University</orgName>
                        </affiliation>
                </author>
        </analytic>
        <monogr>
                <imprint>
                        <date/>
                </imprint>
        </monogr>
        <note>Summary</note>
</biblStruct>

But the document itself seems fairly clear to the human eye, so may be a good test case!

kermitt2 commented 5 years ago

Hi @bananaoomarang ! Thanks this is an interesting case, to relate to #373 #136

The problems are all due to the PDF stream order. You can see the issue simply by trying to highlight the PDF in the PDF viewer of the web browser.

Using the reading order solves the problem, it is implemented but not activated by default because I need to retrain the header model, and for this to put first the training data of the Header model in the reading order rather the current PDF stream order...

Using the reading order - which can be done nows by un-commenting this line: https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/document/DocumentSource.java#L97 ), we then get:

...
                <biblStruct>
                    <analytic>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Brent</forename>
                                <surname>Taylor</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Elizabeth</forename>
                                <surname>Miller</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Paddy</forename>
                                <surname>Farrington</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Maria-Christina</forename>
                                <surname>Petropoulos</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Isabelle</forename>
                                <surname>Favot-Mayaud</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Jun</forename>
                                <surname>Li</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Pauline</forename>
                                <forename type="middle">A</forename>
                                <surname>Waight</surname>
                            </persName>
                        </author>
                        <title level="a" type="main">Autism and measles, mumps, and rubella vaccine: no epidemiological evidence for a causal association</title>
                    </analytic>
                    <monogr>
                        <imprint>
                            <date/>
                        </imprint>
                    </monogr>
                    <note>ARTICLES</note>
                </biblStruct>
            </sourceDesc>
        </fileDesc>
...
        <profileDesc>
            <abstract>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <p>Background We undertook an epidemiological study to investigate whether measles, mumps, and rubella (MMR) vaccine may be causally associated with autism.</p>
                </div>
            </abstract>
        </profileDesc>

which is much better, and consolidation of header will also work.

Putting the training data of the header model in reading order should then make the rest right (abstract and affiliations). So normally grobid is on track to fix all this range of reading order issues, and it's more a matter of development time to get it finalised.