Open oborin1 opened 1 year ago
Thank you very much @oborin1 for the issue ! (or double issue)
Indeed from citations=O Akinboboye, FS McDonald, Letter by Akinboboye and McDonald Regarding Article
, we would expect as correct answer:
<biblStruct >
<monogr>
<title level="m" type="main">Letter by Akinboboye and McDonald Regarding Article</title>
<author>
<persName>
<forename type="first">O</forename>
<surname>Akinboboye</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">F</forename>
<forename type="middle">S</forename>
<surname>McDonald</surname>
</persName>
</author>
<imprint/>
</monogr>
</biblStruct>
With the consolidate citation option selected as in your curl query, the fornames are fine, but we indeed still have the problem of decapitalization for MacDonald:
<biblStruct >
<analytic>
<title level="a" type="main">Letter by Akinboboye and McDonald Regarding Article, “A Blueprint for Productive Maintenance of Certification, but Is the American Board of Internal Medicine up to the Challenge?”</title>
<author>
<persName>
<forename type="first">Olakunle</forename>
<surname>Akinboboye</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">Furman</forename>
<forename type="middle">S</forename>
<surname>Mcdonald</surname>
</persName>
</author>
<idno type="DOI">10.1161/circoutcomes.121.007881</idno>
</analytic>
<monogr>
<title level="j">Circulation: Cardiovascular Quality and Outcomes</title>
<title level="j" type="abbrev">Circ: Cardiovascular Quality and Outcomes</title>
<idno type="ISSN">1941-7713</idno>
<idno type="ISSNe">1941-7705</idno>
<imprint>
<biblScope unit="volume">14</biblScope>
<biblScope unit="issue">6</biblScope>
<date type="published" when="2021-06" />
<publisher>Ovid Technologies (Wolters Kluwer Health)</publisher>
</imprint>
</monogr>
</biblStruct>
While in the Crossref record used for consolidation the capitalization is correct:
I think it's easy to fix, the current post-processing for surname is a bit too simplistic.
Thanks for your answer, @kermitt2! There are some other strings resulting in the same (de)capitalization issues (I expect that also for surnames starting with la, von, van, de, de la, von der, etc.): J. D. van der Waals, Thermodynamische Theorie der Kapillarität unter Voraussetzung stetiger Dichteänderung N.M. Kinkaid, O.M. O'Reilly, P. Papadopoulos Automotive disc brake squeal F. A. Kassan-ogly, One-dimensional Ising model with next-nearest-neighbour interaction in magnetic field
Thank you @oborin1 ! Indeed this need to be reviewed. It's not something that I have not observed before, but we can also find the alternative case, e.g. "Van Ness", "De La Hoya", "De Rossi", etc. so normalizing all these names and variants always scare me a bit :D
A further extension of the issue would be digraphs often met in transliterated middle names. An example string leading to unexpected behavior for Process Citation: A. Yu. Aleksandrov, A. A. Tikhonov, Uniaxial Attitude Stabilization of a Rigid Body under Conditions of Nonstationary Perturbations with Zero Mean Values that would be correctly recognized and consolidated without Yu. I think those cases just did not appear in the training set; perhaps, I have to prepare my own set. A similar problem occurred with apostrophes used in transliterations of Russian surnames as in "Rus'yanova N. D. Uglekhimiya. M. : Nauka, 2003. 316 s."
One more remark to bibtex outputs produced by GROBID: there is a discrepancy between the outputs of authors' names and editors' names as can be seen below. I personally prefer the "von Last, Jr, First" method, where First means an author's first and any further middle names (abbreviations), Jr part is optional, and "and" connects multiple individual persons as in author field. Authors and editors in TEI-XML are treated uniformly, so this is a bug of bibtex method as far as I understand.
@inbook{1, author = {Berger, Thomas and Reis, Timo and Trenn, Stephan}, title = {Observability of Linear Differential-Algebraic Systems: A Survey}, booktitle = {Surveys in Differential-Algebraic Equations IV}, publisher = {Springer International Publishing}, editor = {A. Ilchmann, T. Reis}, date = {2017}, year = {2017}, pages = {161-219}, doi = {10.1007/978-3-319-46618-7_4}, raw = {2. Berger Th., Reis T., Trenn S. Observability of linear differential-algebraic systems --- a survey. Surveys in Differential-Algebraic Equations IV /Ed. A. Ilchmann, T. Reis. Springer Editors, 2017. P. 161--219.} }
@kermitt2, I have a further question regarding this issue: Is https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java, line 1959, the right place where I must insert getMiddleName after a space to obtain the missed second names in bibtex output that I need? And must I change lines 2001-2004 to obtain the same form of record for editors?
I've met an unexpected behavior of your wonderful software in that it returns citation records with decapitalization of surnames, which is unwanted for surnames like McDonald != Mcdonald (concerns both tei and bibtex formats). Another issue is that the second names are removed from the bibtex output strings.
I'm using Docker to run the server as suggested in the documentation. Both issues could be traced using the following command: curl -X POST -H "Accept: application/x-bibtex" -d consolidateCitations=1 -d "citations=O Akinboboye, FS McDonald, Letter by Akinboboye and McDonald Regarding Article" localhost:8070/api/processCitation
Part of output string: author = {Akinboboye, Olakunle and Mcdonald, Furman},