kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

Missed second names in bibtex and decapitalization of surnames #1011

Open oborin1 opened 1 year ago

oborin1 commented 1 year ago

I've met an unexpected behavior of your wonderful software in that it returns citation records with decapitalization of surnames, which is unwanted for surnames like McDonald != Mcdonald (concerns both tei and bibtex formats). Another issue is that the second names are removed from the bibtex output strings.

I'm using Docker to run the server as suggested in the documentation. Both issues could be traced using the following command: curl -X POST -H "Accept: application/x-bibtex" -d consolidateCitations=1 -d "citations=O Akinboboye, FS McDonald, Letter by Akinboboye and McDonald Regarding Article" localhost:8070/api/processCitation

Part of output string: author = {Akinboboye, Olakunle and Mcdonald, Furman},

kermitt2 commented 1 year ago

Thank you very much @oborin1 for the issue ! (or double issue)

Indeed from citations=O Akinboboye, FS McDonald, Letter by Akinboboye and McDonald Regarding Article, we would expect as correct answer:

<biblStruct >
    <monogr>
        <title level="m" type="main">Letter by Akinboboye and McDonald Regarding Article</title>
        <author>
            <persName>
                <forename type="first">O</forename>
                <surname>Akinboboye</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">F</forename>
                <forename type="middle">S</forename>
                <surname>McDonald</surname>
            </persName>
        </author>
        <imprint/>
    </monogr>
</biblStruct>

With the consolidate citation option selected as in your curl query, the fornames are fine, but we indeed still have the problem of decapitalization for MacDonald:

<biblStruct >
    <analytic>
        <title level="a" type="main">Letter by Akinboboye and McDonald Regarding Article, “A Blueprint for Productive Maintenance of Certification, but Is the American Board of Internal Medicine up to the Challenge?”</title>
        <author>
            <persName>
                <forename type="first">Olakunle</forename>
                <surname>Akinboboye</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">Furman</forename>
                <forename type="middle">S</forename>
                <surname>Mcdonald</surname>
            </persName>
        </author>
        <idno type="DOI">10.1161/circoutcomes.121.007881</idno>
    </analytic>
    <monogr>
        <title level="j">Circulation: Cardiovascular Quality and Outcomes</title>
        <title level="j" type="abbrev">Circ: Cardiovascular Quality and Outcomes</title>
        <idno type="ISSN">1941-7713</idno>
        <idno type="ISSNe">1941-7705</idno>
        <imprint>
            <biblScope unit="volume">14</biblScope>
            <biblScope unit="issue">6</biblScope>
            <date type="published" when="2021-06" />
            <publisher>Ovid Technologies (Wolters Kluwer Health)</publisher>
        </imprint>
    </monogr>
</biblStruct>

While in the Crossref record used for consolidation the capitalization is correct: Screenshot from 2023-05-07 17-24-31

I think it's easy to fix, the current post-processing for surname is a bit too simplistic.

oborin1 commented 1 year ago

Thanks for your answer, @kermitt2! There are some other strings resulting in the same (de)capitalization issues (I expect that also for surnames starting with la, von, van, de, de la, von der, etc.): J. D. van der Waals, Thermodynamische Theorie der Kapillarität unter Voraussetzung stetiger Dichteänderung N.M. Kinkaid, O.M. O'Reilly, P. Papadopoulos Automotive disc brake squeal F. A. Kassan-ogly, One-dimensional Ising model with next-nearest-neighbour interaction in magnetic field

kermitt2 commented 1 year ago

Thank you @oborin1 ! Indeed this need to be reviewed. It's not something that I have not observed before, but we can also find the alternative case, e.g. "Van Ness", "De La Hoya", "De Rossi", etc. so normalizing all these names and variants always scare me a bit :D

oborin1 commented 1 year ago

A further extension of the issue would be digraphs often met in transliterated middle names. An example string leading to unexpected behavior for Process Citation: A. Yu. Aleksandrov, A. A. Tikhonov, Uniaxial Attitude Stabilization of a Rigid Body under Conditions of Nonstationary Perturbations with Zero Mean Values that would be correctly recognized and consolidated without Yu. I think those cases just did not appear in the training set; perhaps, I have to prepare my own set. A similar problem occurred with apostrophes used in transliterations of Russian surnames as in "Rus'yanova N. D. Uglekhimiya. M. : Nauka, 2003. 316 s."

oborin1 commented 1 year ago

One more remark to bibtex outputs produced by GROBID: there is a discrepancy between the outputs of authors' names and editors' names as can be seen below. I personally prefer the "von Last, Jr, First" method, where First means an author's first and any further middle names (abbreviations), Jr part is optional, and "and" connects multiple individual persons as in author field. Authors and editors in TEI-XML are treated uniformly, so this is a bug of bibtex method as far as I understand.

@inbook{1, author = {Berger, Thomas and Reis, Timo and Trenn, Stephan}, title = {Observability of Linear Differential-Algebraic Systems: A Survey}, booktitle = {Surveys in Differential-Algebraic Equations IV}, publisher = {Springer International Publishing}, editor = {A. Ilchmann, T. Reis}, date = {2017}, year = {2017}, pages = {161-219}, doi = {10.1007/978-3-319-46618-7_4}, raw = {2. Berger Th., Reis T., Trenn S. Observability of linear differential-algebraic systems --- a survey. Surveys in Differential-Algebraic Equations IV /Ed. A. Ilchmann, T. Reis. Springer Editors, 2017. P. 161--219.} }

oborin1 commented 11 months ago

@kermitt2, I have a further question regarding this issue: Is https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java, line 1959, the right place where I must insert getMiddleName after a space to obtain the missed second names in bibtex output that I need? And must I change lines 2001-2004 to obtain the same form of record for editors?