openaire / iis

Information Inference Service of the OpenAIRE system
Apache License 2.0
20 stars 11 forks source link

Fix the JATS ingester module responsible for authors and affiliations parsing #1464

Closed marekhorst closed 3 weeks ago

marekhorst commented 2 months ago

Originally requested in redmine: https://support.openaire.eu/issues/9982.

After running JATS ingester module (which was originally prepared to handle JATS records coming from PubMed) on Springer JATS records it turned out the affiliations are not correctly linked to authors. As reported by Miriam in #9976#note-3:

Always for the 50|doi_____::974d087615705cffb4d0d7bdc6394ffc result (I have not checked the others), there is a peculiar behaviour in the affiliation association between the author and the affiliation position: all the authors have been associated to all the affiliations.

I was able to reproduce this behavior in a dedicated test case which revealed slightly different contributor encoding involving multiple layers of nested contrib-group, contrib and collab elements such as:

<contrib-group>
    <contrib contrib-type="author" id="IAu1" corresp="yes">
        <xref ref-type="aff" rid="Aff4">4</xref>
        <collab>
            <institution>COVID-19 Host Genetics Initiative</institution>
            <contrib-group>
                <contrib contrib-type="author" id="IAu2">
                    <collab>
                        <institution>COVID-19 Host Genetics InitiativeLeadership</institution>
                        <contrib-group>
                            <contrib contrib-type="author" id="Au1">
                                <name name-style="western">
                                    <surname>Niemi</surname>
                                    <given-names>Mari E. K.</given-names>
                                </name>
                                <xref ref-type="aff" rid="Aff1">1</xref>
                            </contrib>
                            <contrib contrib-type="author" id="Au2">
                                <name name-style="western">
                                    <surname>Karjalainen</surname>
                                    <given-names>Juha</given-names>
                                </name>
                                <xref ref-type="aff" rid="Aff2">2</xref>
                            </contrib>
                        </contrib-group>
                    </collab>
                </contrib>
            </contrib-group>
        </collab>
    </contrib>
    <aff id="Aff1">
        <label>1</label>
        <institution-wrap>
            <institution-id institution-id-type="GRID">grid.452494.a</institution-id>
            <institution-id institution-id-type="ISNI">0000 0004 0409 5350</institution-id>
            <institution content-type="org-name">Institute for Molecular Medicine Finland (FIMM), University of Helsinki</institution>
        </institution-wrap>
        <addr-line content-type="city">Helsinki</addr-line>
        <country country="FI">Finland</country>
    </aff>
    <aff id="Aff2">
        <label>2</label>
        <institution-wrap>
            <institution-id institution-id-type="GRID">grid.66859.34</institution-id>
            <institution content-type="org-name">Broad Institute of MIT and Harvard</institution>
        </institution-wrap>
        <addr-line content-type="city">Cambridge</addr-line>
        <addr-line content-type="state">MA</addr-line>
        <country country="US">USA</country>
    </aff>
    <aff id="Aff4">
        <label>4</label>
        <institution-wrap>
            <institution-id institution-id-type="GRID">grid.66859.34</institution-id>
            <institution content-type="org-name">Massachusetts General Hospital, Broad Institute of MIT and Harvard</institution>
        </institution-wrap>
        <addr-line content-type="city">Cambridge</addr-line>
        <addr-line content-type="state">MA</addr-line>
        <country country="US">USA</country>
    </aff>
</contrib-group>

which was not properly handled by the ArticleMetaXmlHandler.

Additionally it was discovered an author name was also broken by the nested structure of contributors where an institution name from parent contributor was glued with the first child contributor name. This also needs to be fixed.

marekhorst commented 2 months ago

As a follow-up of this task the PMC cache should be updated by dropping the most recent update including an outcome of Springer records parsing and rerunning the PMC ingestion involving cache update (e.g. as a part of the IIS primary job).

marekhorst commented 3 weeks ago

The fix for #1464 became a part of #1466 fix and got introduced with this commit: https://github.com/openaire/iis/commit/a8f5a302877d9241e94d3a6f67c364e19cddc7cf