kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.61k stars 461 forks source link

Affiliation extraction Error Ex1 #521

Closed adouib closed 2 years ago

adouib commented 4 years ago

Dears all,

I have a pdf document with many authors and associated affiliations (see attached document), and two kinds of affiliation extraction issues appears :

  1. Two affiliations, affiliations 16 and 17 for example, are merged (see extraction result below) which produces two countries in the same xml tag ("China, Hungary")
<author>
    <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Z</forename><surname>Li</surname></persName>
    <affiliation key="aff16">
        <orgName type="laboratory">School of Physics and State Key Laboratory of Nuclear Physics and Technology</orgName>
        <orgName type="institution">Peking University</orgName>
        <address>
            <addrLine>17 MTA Atomki, Debrecen H-4001</addrLine>
            <postBox>P.O. Box 51</postBox>
            <postCode>100871</postCode>
            <settlement>Beijing</settlement>
            <country>China, Hungary</country>
        </address>
    </affiliation>
</author>
  1. Some affiliation information (Institute) are not associated to the correct affiliation, affiliation 10 and 11 (see the extraction result below)
<author>
    <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Z</forename><forename type="middle">Y</forename><surname>Xu</surname></persName>
    <affiliation key="aff10">
        <orgName type="department" key="dep1">Department of Physics</orgName>
        <orgName type="department" key="dep2">School of Computing, Engineering and Mathematics</orgName>
        <orgName type="institution">University of Tokyo</orgName>
        <address>
            <addrLine>Hongo 7-3-1, Bunkyo-ku</addrLine>
            <postCode>113-0033</postCode>
            <settlement>Tokyo</settlement>
            <country key="JP">Japan</country>
        </address>
    </affiliation>
    <affiliation key="aff11">
        <orgName type="institution">University of Brighton</orgName>
        <address>
            <postCode>BN2 4JG</postCode>
            <settlement>Brighton</settlement>
            <country key="GB">United Kingdom</country>
        </address>
    </affiliation>
</author>

May be this caused by the training process (training data), but if there are other technical solutions it will be better. test.pdf

kermitt2 commented 4 years ago

Hello @adouib and thank you for the error case!

  1. Those affiliation merging errors are unfortunately relatively frequent currently. The problem is the segmentation of affiliations. The two affiliations are view as one and the fields of the two are concatenated. The source error is that the index number of Hungarian affiliation has been labelled as address line instead of key index (see below).

Actually I didn't look at the affiliation/address parser since 4-5 years, and it would require some basic fixes to avoid those stupid errors (without more training data as you point). We can use the superscript attribute as feature now to get more precisely the index markers from author and affiliations, there is an error for the line begin/end feature (see below), and the affiliation could be better segmented taking into account the layout (segmentation by space criteria).

I would really welcome developer contribution on this, there are other developments in progress to be done in the next months so I won't look at that - and this is a narrow model, easier to address without interactions with other models.

16  16  1   16  16  16  6   16  16  16  LINEEND NOCAPS  ALLDIGIT    0   0   0   0   0   0   NOPUNCT dd  I-<affiliation> I-<marker>
School  school  S   Sc  Sch Scho    l   ol  ool hool    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   I-<department>
of  of  o   of  of  of  f   of  of  of  LINEEND NOCAPS  NODIGIT 0   0   1   0   0   0   NOPUNCT xx  <affiliation>   <department>
Physics physics P   Ph  Phy Phys    s   cs  ics sics    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   <department>
and and a   an  and and d   nd  and and LINEEND NOCAPS  NODIGIT 0   0   1   0   0   0   NOPUNCT xxx <affiliation>   <department>
State   state   S   St  Sta Stat    e   te  ate tate    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   <department>
Key key K   Ke  Key Key y   ey  Key Key LINEEND INITCAP NODIGIT 0   1   1   0   0   0   NOPUNCT Xxx <affiliation>   <department>
Laboratory  laboratory  L   La  Lab Labo    y   ry  ory tory    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   <department>
of  of  o   of  of  of  f   of  of  of  LINEEND NOCAPS  NODIGIT 0   0   1   0   0   0   NOPUNCT xx  <affiliation>   <department>
Nuclear nuclear N   Nu  Nuc Nucl    r   ar  ear lear    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   <department>
Physics physics P   Ph  Phy Phys    s   cs  ics sics    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   <department>
and and a   an  and and d   nd  and and LINEEND NOCAPS  NODIGIT 0   0   1   0   0   0   NOPUNCT xxx <affiliation>   <department>
Technology  technology  T   Te  Tec Tech    y   gy  ogy logy    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   <department>
,   ,   ,   ,   ,   ,   ,   ,   ,   ,   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   COMMA   ,   <affiliation>   I-<other>
Peking  peking  P   Pe  Pek Peki    g   ng  ing king    LINEEND INITCAP NODIGIT 0   0   0   0   1   0   NOPUNCT Xxxx    <affiliation>   I-<institution>
University  university  U   Un  Uni Univ    y   ty  ity sity    LINEEND INITCAP NODIGIT 0   0   1   0   0   0   NOPUNCT Xxxx    <affiliation>   <institution>
,   ,   ,   ,   ,   ,   ,   ,   ,   ,   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   COMMA   ,   <affiliation>   I-<other>
Beijing beijing B   Be  Bei Beij    g   ng  ing jing    LINEEND INITCAP NODIGIT 0   0   0   0   1   0   NOPUNCT Xxxx    I-<address> I-<settlement>
100871  100871  1   10  100 1008    1   71  871 0871    LINEEND NOCAPS  ALLDIGIT    0   0   0   0   0   0   NOPUNCT dddd    <address>   I-<postCode>
,   ,   ,   ,   ,   ,   ,   ,   ,   ,   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   COMMA   ,   <address>   I-<other>
China   china   C   Ch  Chi Chin    a   na  ina hina    LINEEND INITCAP NODIGIT 0   1   1   0   0   1   NOPUNCT Xxxx    <address>   I-<country>
17  17  1   17  17  17  7   17  17  17  LINEEND NOCAPS  ALLDIGIT    0   0   0   0   0   0   NOPUNCT dd  <address>   I-<addrLine>
MTA mta M   MT  MTA MTA A   TA  MTA MTA LINEEND ALLCAPS NODIGIT 0   0   0   0   0   0   NOPUNCT XXX <address>   <addrLine>
Atomki  atomki  A   At  Ato Atom    i   ki  mki omki    LINEEND INITCAP NODIGIT 0   0   0   0   0   0   NOPUNCT Xxxx    <address>   <addrLine>
,   ,   ,   ,   ,   ,   ,   ,   ,   ,   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   COMMA   ,   <address>   I-<other>
P   p   P   P   P   P   P   P   P   P   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   NOPUNCT X   <address>   I-<postBox>
.   .   .   .   .   .   .   .   .   .   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   DOT .   <address>   <postBox>
O   o   O   O   O   O   O   O   O   O   LINEEND ALLCAPS NODIGIT 1   0   0   0   1   0   NOPUNCT X   <address>   <postBox>
.   .   .   .   .   .   .   .   .   .   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   DOT .   <address>   <postBox>
Box box B   Bo  Box Box x   ox  Box Box LINEEND INITCAP NODIGIT 0   1   1   0   0   0   NOPUNCT Xxx <address>   <postBox>
51  51  5   51  51  51  1   51  51  51  LINEEND NOCAPS  ALLDIGIT    0   0   0   0   0   0   NOPUNCT dd  <address>   <postBox>
,   ,   ,   ,   ,   ,   ,   ,   ,   ,   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   COMMA   ,   <address>   I-<other>
Debrecen    debrecen    D   De  Deb Debr    n   en  cen ecen    LINEEND INITCAP NODIGIT 0   0   0   0   1   0   NOPUNCT Xxxx    <address>   I-<addrLine>
H   h   H   H   H   H   H   H   H   H   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   NOPUNCT X   <address>   <addrLine>
-   -   -   -   -   -   -   -   -   -   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   HYPHEN  -   <address>   <addrLine>
4001    4001    4   40  400 4001    1   01  001 4001    LINEEND NOCAPS  ALLDIGIT    0   0   0   0   0   0   NOPUNCT dddd    <address>   <addrLine>
,   ,   ,   ,   ,   ,   ,   ,   ,   ,   LINEEND ALLCAPS NODIGIT 1   0   0   0   0   0   COMMA   ,   <address>   I-<other>
Hungary hungary H   Hu  Hun Hung    y   ry  ary gary    LINEEND INITCAP NODIGIT 0   0   0   0   0   1   NOPUNCT Xxxx    <address>   I-<country>
  1. For this one I have something a bit different with Grobid current master version 0.6.0-SNAPSHOT:
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Z</forename>
                                <forename type="middle">Y</forename>
                                <surname>Xu</surname>
                            </persName>
                            <affiliation key="aff10">
                                <orgName type="department">Department of Physics</orgName>
                                <orgName type="institution">University of Tokyo</orgName>
                                <address>
                                    <addrLine>Hongo 7-3-1, Bunkyo-ku</addrLine>
                                    <postCode>113-0033</postCode>
                                    <settlement>Tokyo</settlement>
                                    <country key="JP">Japan</country>
                                </address>
                            </affiliation>
                        </author>

But then affiliation 11 is not attached to its author, because the index marker token introducing the affiliation is missing. Again something weird in the segmentation process.

manuelguzmandao commented 4 years ago

Hello @kermitt2, We will start assigning some time internally for contributions to Grobid, starting with this one. As it will be our first try with this code, is there any guide you can give us to understand code's structure, mainly for this change? any hint is welcomed!

kermitt2 commented 4 years ago

As the header model and the affiliation part have been updated, there are some progress. Regarding 1, there should not be this kind of merging:

                       <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Z</forename>
                                <surname>Li</surname>
                            </persName>
                            <affiliation key="aff16">
                                <orgName type="department">School of Physics and State Key Laboratory of Nuclear Physics and Technology</orgName>
                                <orgName type="institution">Peking University</orgName>
                                <address>
                                    <postCode>100871</postCode>
                                    <settlement>Beijing</settlement>
                                    <country key="CN">China</country>
                                </address>
                            </affiliation>
                        </author>

Affiliation 10 was already correctly associated before the last updates, and affiliation 11 is now correctly recognized and assigned too:

                       <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">F</forename>
                                <surname>Browne</surname>
                            </persName>
                            <affiliation key="aff4">
                                <orgName type="institution" key="instit1">RIKEN Nishina Center</orgName>
                                <orgName type="institution" key="instit2">RIKEN</orgName>
                                <address>
                                    <addrLine>2-1 Hirosawa, Wako-shi</addrLine>
                                    <postCode>351-0198</postCode>
                                    <settlement>Saitama</settlement>
                                    <country key="JP">Japan</country>
                                </address>
                            </affiliation>
                            <affiliation key="aff11">
                                <orgName type="department">School of Computing, Engineering and Mathematics</orgName>
                                <orgName type="institution">University of Brighton</orgName>
                                <address>
                                    <postCode>BN2 4JG</postCode>
                                    <settlement>Brighton</settlement>
                                    <country key="GB">United Kingdom</country>
                                </address>
                            </affiliation>
                        </author>