kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.43k stars 443 forks source link

Incorrectly parsing author with middle initial (and all authors following it) in reference when it is the first author #305

Open puggimer opened 6 years ago

puggimer commented 6 years ago

MeltdownPrime.pdf Specifically References 13, 14 and 15 in the attached document. Yatin A. Manerkar is parsed as A Yatin, then it incorrectly combines the first name of the next author with the last name of this one - so Daniel Lustig shows up as Daniel Manerkar etc.

This only appears to happen when the first author in the list has a middle initial. The same author is repeated later (in reference 17) but is the second in the list, and it correctly parses the name into first, middle and surname.

To show the full example - here is the reference citation with 4 authors

Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Michael Pellauer. RTLCheck: Verifying the memory consistency of rtl designs. In 50th International Symposium on Microarchitecture (MICRO), 2017.

The generated TEI for it is

<biblStruct xml:id="b13">
    <analytic>
        <title level="a" type="main">RTLCheck: Verifying the memory consistency of rtl designs</title>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">A</forename>
                <surname>Yatin</surname>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">Daniel</forename>
                <surname>Manerkar</surname>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">Margaret</forename>
                <surname>Lustig</surname>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">Michael</forename>
                <surname>Martonosi</surname>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <surname>Pellauer</surname>
            </persName>
        </author>
    </analytic>
    <monogr>
        <title level="m">50th International Symposium on Microarchitecture</title>
        <imprint>
            <date type="published" when="2017" />
        </imprint>
    </monogr>
</biblStruct>
de-code commented 3 years ago

Just to add another similar error case.

From the bioRxiv training dataset, 099754v1 (10.1101/099754).

[1] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan JFrey.Predicting the sequence specificities of dna-and rna-bindingproteins by deep learning.Nature biotechnology, 2015.

<biblStruct xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="b0">
    <analytic>
        <title level="a" type="main">Predicting the sequence specificities of dna-and rna-binding proteins by deep learning</title>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Babak</forename><surname>Alipanahi</surname></persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Andrew</forename><surname>Delong</surname></persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">T</forename><surname>Matthew</surname></persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Brendan</forename><forename type="middle">J</forename><surname>Weirauch</surname></persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0"><surname>Frey</surname></persName>
        </author>
    </analytic>
    <monogr>
        <title level="j">Nature biotechnology</title>
        <imprint>
            <date type="published" when="2015"/>
        </imprint>
    </monogr>
    <note type="raw_reference">Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature biotechnology, 2015.</note>
</biblStruct>