kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.61k stars 461 forks source link

Support for biblio-glutton 0.3 #1086

Closed kermitt2 closed 2 months ago

kermitt2 commented 9 months ago

This PR enables the support of the latest version of biblio-glutton (0.3), which extends the bibliographical reference matching to HAL archive (around 3.5M records), beyond CrossRef records with DOI.

In practice, consolidation can now resolve raw bibliographical references against HAL records, in case it is not present in Crossref. HAL ID are also added when we have a DOI matching for a record also present on HAL.

For example, with the PDF https://hal.science/hal-04303155v2, we have several consolidated entries with HAL ID and no DOI in the bibliographical reference section:

                    <biblStruct xml:id="b14">
                        <analytic>
                            <title level="a" type="main">Grafting of nitrophenyl groups on carbon and metallic surfaces without electrochemical induction</title>
                            <author>
                                <persName>
                                    <forename type="first">A</forename>
                                    <surname>Adenier</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">E</forename>
                                    <surname>Cabet-Deliry</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">A</forename>
                                    <surname>Chaussé</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">S</forename>
                                    <surname>Griveau</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Florian</forename>
                                    <surname>Mercier</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">J</forename>
                                    <surname>Pinson</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Christine</forename>
                                    <surname>Vautrin-Ul</surname>
                                </persName>
                            </author>
                            <idno type="HALid">hal-00157436</idno>
                        </analytic>
                        <monogr>
                            <title level="j">Chem. Mater</title>
                            <idno type="ISSN">0897-4756</idno>
                            <imprint>
                                <biblScope unit="volume">17</biblScope>
                                <biblScope unit="page" from="491" to="501"/>
                                <date type="published" when="2005">2005</date>
                                <publisher>American Chemical Society</publisher>
                            </imprint>
                        </monogr>
                    </biblStruct>

as well as consolidated entries with both DOI and HAL ID:

                    <biblStruct xml:id="b16">
                        <analytic>
                            <title level="a" type="main">Evidence of the Grafting Mechanisms of Diazonium Salts on Gold Nanostructures</title>
                            <author>
                                <persName>
                                    <forename type="first">Stéphanie</forename>
                                    <surname>Betelu</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Inga</forename>
                                    <surname>Tijunelyte</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Leïla</forename>
                                    <surname>Boubekeur-Lecaque</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Ioannis</forename>
                                    <surname>Ignatiadis</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Joyce</forename>
                                    <surname>Ibrahim</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Stéphane</forename>
                                    <surname>Gaboreau</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Catherine</forename>
                                    <surname>Berho</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Timothée</forename>
                                    <surname>Toury</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Erwann</forename>
                                    <surname>Guenin</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Nathalie</forename>
                                    <surname>Lidgi-Guigui</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Nordin</forename>
                                    <surname>Felidj</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Emmanuel</forename>
                                    <surname>Rinnert</surname>
                                </persName>
                            </author>
                            <author>
                                <persName>
                                    <forename type="first">Marc</forename>
                                    <surname>Lamy De La Chapelle</surname>
                                </persName>
                            </author>
                            <idno type="DOI">10.1021/acs.jpcc.6b06486</idno>
                            <idno type="HALid">hal-01685660</idno>
                        </analytic>
                        <monogr>
                            <title level="j">J. Phys. Chem. C</title>
                            <idno type="ISSN">1932-7447</idno>
                            <imprint>
                                <biblScope unit="volume">120</biblScope>
                                <biblScope unit="issue">32</biblScope>
                                <biblScope unit="page" from="18158" to=" 18166"/>
                                <date type="published" when="2016">2016</date>
                                <publisher>American Chemical Society</publisher>
                            </imprint>
                        </monogr>
                    </biblStruct>
coveralls commented 5 months ago

Coverage Status

coverage: 40.77% (-0.02%) from 40.787% when pulling a77114daa7405a8845098bc8f488b61098849867 on glutton-0.3 into 694f0ed055e8c9a5d5efdc314ebef78e5e2640cf on master.

lfoppiano commented 3 months ago

I tested grobid 0.8.0 with both glutton 0.2 and glutton 0.3, and grobid 0.8.1 with both glutton 0.2 and 0.3. They all work fine.

I also ran into some problems with ES on my side and wrote a troubleshooting section in the documentation, in case this happens again.