kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.39k stars 443 forks source link

Improve the merging of article metadata when doing consolidation, in particular avoid affiliations being lost in the process #517

Open maria-grigorieva opened 4 years ago

maria-grigorieva commented 4 years ago

Hi, I have several HEP papers where Grobid doesn't recognize affiliations. For example, this one: http://inspirehep.net/record/1699866/files/10.1088_1742-6596_1085_3_032051.pdf

I tested it online: http://cloud.science-miner.com/grobid/ and locally (repository was cloned from master branch).

Could you please clarify the reason? Probably I should use another branch?

kermitt2 commented 4 years ago

Hello @maria-grigorieva

Thanks for the issue !

Actually Grobid recognizes and attaches correctly the affiliations for your example. If you unselect "consolidate header", you will see the expected recognition of affiliations.

We have a different bug, which is that when the header metadata are matched in CrossRef (consolitation), the authors from CrossRef metadata gets prioritized and rewrite the extracted authors, and unfortunately with the current state, the affiliations are lost... The consolidation of header data requires more work to better merge the extracted data and the matched publisher's metadata, as visible in your example.

This is something pending since quite a while, that I know, but didn't find the time to revisit it. Actually this example is a very good test case.

(for the first author - without consolidation- you will see that his initials FH are not correctly attached to the author, because his name has an uncommon pattern and "Barreiro" is seen as forename. This is where we see that consolidation is helpful for fixing these small errors, but it should work as merging rather than simple rewriting as it is now).

maria-grigorieva commented 4 years ago

Hi, Patrice

Thank you so much for the quick response. Yes, really, it works without consolidation service. I apologize that I didn't check it before submitting this issue. For me, affiliations are very important as I want to enrich papers metadata (taken from InspireHEP) with affiliations (where needed), for example, to get all papers with authors from a particular university. Even without links with authors.
So, probably it's possible to catch just the names of universities and put them in TEI structure? (Of course, I understand that it can be done manually (not within Grobid), but in this case I need some list of all universities and its abbreviations...)

Kind Regards, Maria

kermitt2 commented 4 years ago

This is now working fine in the updated master, below the consolidated header, preserving affiliations. However, a couple of other things still to improve in this particular case - the email recognition and attachment, and the country code in the affiliations.

<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve"
    xmlns="http://www.tei-c.org/ns/1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 /home/lopez/grobid/grobid-home/schemas/xsd/Grobid.xsd"
    xmlns:xlink="http://www.w3.org/1999/xlink">
    <teiHeader xml:lang="en">
        <fileDesc>
            <titleStmt>
                <title level="a" type="main">Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System</title>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Barreiro</forename>
                                <surname>Megino</surname>
                            </persName>
                            <affiliation key="aff0">
                                <orgName type="institution">University of Texas at Arlington (US)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">M</forename>
                                <surname>Borodin</surname>
                            </persName>
                            <affiliation key="aff1">
                                <orgName type="institution">University of Iowa (US)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">D</forename>
                                <surname>Golubkov</surname>
                            </persName>
                            <affiliation key="aff2">
                                <orgName type="institution">Institute for High Energy Physics (RU)</orgName>
                            </affiliation>
                            <affiliation key="aff3">
                                <orgName type="institution">National Research Centre Kurchatov Institute (RU)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">M</forename>
                                <surname>Grigorieva</surname>
                            </persName>
                            <affiliation key="aff3">
                                <orgName type="institution">National Research Centre Kurchatov Institute (RU)</orgName>
                            </affiliation>
                            <affiliation key="aff4">
                                <orgName type="institution">National Research Tomsk Polytechnic University (RU)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">M</forename>
                                <surname>Gubin</surname>
                            </persName>
                            <affiliation key="aff4">
                                <orgName type="institution">National Research Tomsk Polytechnic University (RU)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">A</forename>
                                <surname>Klimentov</surname>
                            </persName>
                            <affiliation key="aff3">
                                <orgName type="institution">National Research Centre Kurchatov Institute (RU)</orgName>
                            </affiliation>
                            <affiliation key="aff5">
                                <orgName type="institution">Brookhaven National Laboratory (US)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">T</forename>
                                <surname>Korchuganova</surname>
                            </persName>
                            <affiliation key="aff4">
                                <orgName type="institution">National Research Tomsk Polytechnic University (RU)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">T</forename>
                                <surname>Maeno</surname>
                            </persName>
                            <affiliation key="aff5">
                                <orgName type="institution">Brookhaven National Laboratory (US)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">S</forename>
                                <surname>Padolski</surname>
                            </persName>
                            <affiliation key="aff5">
                                <orgName type="institution">Brookhaven National Laboratory (US)</orgName>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">M</forename>
                                <surname>Titov</surname>
                            </persName>
                        </author>
                        <title level="a" type="main">Predictive analytics tools to adjust and monitor performance metrics for the ATLAS Production System</title>
                    </analytic>
                    <monogr>
                        <imprint>
                            <date/>
                        </imprint>
                    </monogr>
                    <idno type="DOI">10.1088/1742-6596/1085/3/032051</idno>
                </biblStruct>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <appInfo>
                <application version="0.6.1-SNAPSHOT" ident="GROBID" when="2020-08-11T11:52+0000">
                    <desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
                    <ref target="https://github.com/kermitt2/grobid"/>
                </application>
            </appInfo>
        </encodingDesc>
        <profileDesc>
            <abstract>
                <p>Having information such as an estimation of the processing time or possibility of system outage (abnormal behaviour) helps to assist to monitor system performance and to predict its next state. The current cyber-infrastructure of the ATLAS Production System presents computing conditions in which contention for resources among high-priority data analyses happens routinely, that might lead to significant workload and data handling interruptions. The lack of the possibility to monitor and to predict the behaviour of the analysis process (its duration) and system&apos;s state itself provides motivation for a focus on design of the built-in situational awareness analytic tools.</p>
            </abstract>
        </profileDesc>
    </teiHeader>
    <text xml:lang="en"></text>
</TEI>