Closed adouib closed 2 years ago
Hello @adouib and thank you for the error case!
Actually I didn't look at the affiliation/address parser since 4-5 years, and it would require some basic fixes to avoid those stupid errors (without more training data as you point). We can use the superscript attribute as feature now to get more precisely the index markers from author and affiliations, there is an error for the line begin/end feature (see below), and the affiliation could be better segmented taking into account the layout (segmentation by space criteria).
I would really welcome developer contribution on this, there are other developments in progress to be done in the next months so I won't look at that - and this is a narrow model, easier to address without interactions with other models.
16 16 1 16 16 16 6 16 16 16 LINEEND NOCAPS ALLDIGIT 0 0 0 0 0 0 NOPUNCT dd I-<affiliation> I-<marker>
School school S Sc Sch Scho l ol ool hool LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> I-<department>
of of o of of of f of of of LINEEND NOCAPS NODIGIT 0 0 1 0 0 0 NOPUNCT xx <affiliation> <department>
Physics physics P Ph Phy Phys s cs ics sics LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> <department>
and and a an and and d nd and and LINEEND NOCAPS NODIGIT 0 0 1 0 0 0 NOPUNCT xxx <affiliation> <department>
State state S St Sta Stat e te ate tate LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> <department>
Key key K Ke Key Key y ey Key Key LINEEND INITCAP NODIGIT 0 1 1 0 0 0 NOPUNCT Xxx <affiliation> <department>
Laboratory laboratory L La Lab Labo y ry ory tory LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> <department>
of of o of of of f of of of LINEEND NOCAPS NODIGIT 0 0 1 0 0 0 NOPUNCT xx <affiliation> <department>
Nuclear nuclear N Nu Nuc Nucl r ar ear lear LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> <department>
Physics physics P Ph Phy Phys s cs ics sics LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> <department>
and and a an and and d nd and and LINEEND NOCAPS NODIGIT 0 0 1 0 0 0 NOPUNCT xxx <affiliation> <department>
Technology technology T Te Tec Tech y gy ogy logy LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> <department>
, , , , , , , , , , LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 COMMA , <affiliation> I-<other>
Peking peking P Pe Pek Peki g ng ing king LINEEND INITCAP NODIGIT 0 0 0 0 1 0 NOPUNCT Xxxx <affiliation> I-<institution>
University university U Un Uni Univ y ty ity sity LINEEND INITCAP NODIGIT 0 0 1 0 0 0 NOPUNCT Xxxx <affiliation> <institution>
, , , , , , , , , , LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 COMMA , <affiliation> I-<other>
Beijing beijing B Be Bei Beij g ng ing jing LINEEND INITCAP NODIGIT 0 0 0 0 1 0 NOPUNCT Xxxx I-<address> I-<settlement>
100871 100871 1 10 100 1008 1 71 871 0871 LINEEND NOCAPS ALLDIGIT 0 0 0 0 0 0 NOPUNCT dddd <address> I-<postCode>
, , , , , , , , , , LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 COMMA , <address> I-<other>
China china C Ch Chi Chin a na ina hina LINEEND INITCAP NODIGIT 0 1 1 0 0 1 NOPUNCT Xxxx <address> I-<country>
17 17 1 17 17 17 7 17 17 17 LINEEND NOCAPS ALLDIGIT 0 0 0 0 0 0 NOPUNCT dd <address> I-<addrLine>
MTA mta M MT MTA MTA A TA MTA MTA LINEEND ALLCAPS NODIGIT 0 0 0 0 0 0 NOPUNCT XXX <address> <addrLine>
Atomki atomki A At Ato Atom i ki mki omki LINEEND INITCAP NODIGIT 0 0 0 0 0 0 NOPUNCT Xxxx <address> <addrLine>
, , , , , , , , , , LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 COMMA , <address> I-<other>
P p P P P P P P P P LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 NOPUNCT X <address> I-<postBox>
. . . . . . . . . . LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 DOT . <address> <postBox>
O o O O O O O O O O LINEEND ALLCAPS NODIGIT 1 0 0 0 1 0 NOPUNCT X <address> <postBox>
. . . . . . . . . . LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 DOT . <address> <postBox>
Box box B Bo Box Box x ox Box Box LINEEND INITCAP NODIGIT 0 1 1 0 0 0 NOPUNCT Xxx <address> <postBox>
51 51 5 51 51 51 1 51 51 51 LINEEND NOCAPS ALLDIGIT 0 0 0 0 0 0 NOPUNCT dd <address> <postBox>
, , , , , , , , , , LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 COMMA , <address> I-<other>
Debrecen debrecen D De Deb Debr n en cen ecen LINEEND INITCAP NODIGIT 0 0 0 0 1 0 NOPUNCT Xxxx <address> I-<addrLine>
H h H H H H H H H H LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 NOPUNCT X <address> <addrLine>
- - - - - - - - - - LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 HYPHEN - <address> <addrLine>
4001 4001 4 40 400 4001 1 01 001 4001 LINEEND NOCAPS ALLDIGIT 0 0 0 0 0 0 NOPUNCT dddd <address> <addrLine>
, , , , , , , , , , LINEEND ALLCAPS NODIGIT 1 0 0 0 0 0 COMMA , <address> I-<other>
Hungary hungary H Hu Hun Hung y ry ary gary LINEEND INITCAP NODIGIT 0 0 0 0 0 1 NOPUNCT Xxxx <address> I-<country>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Z</forename>
<forename type="middle">Y</forename>
<surname>Xu</surname>
</persName>
<affiliation key="aff10">
<orgName type="department">Department of Physics</orgName>
<orgName type="institution">University of Tokyo</orgName>
<address>
<addrLine>Hongo 7-3-1, Bunkyo-ku</addrLine>
<postCode>113-0033</postCode>
<settlement>Tokyo</settlement>
<country key="JP">Japan</country>
</address>
</affiliation>
</author>
But then affiliation 11 is not attached to its author, because the index marker token introducing the affiliation is missing. Again something weird in the segmentation process.
Hello @kermitt2, We will start assigning some time internally for contributions to Grobid, starting with this one. As it will be our first try with this code, is there any guide you can give us to understand code's structure, mainly for this change? any hint is welcomed!
As the header model and the affiliation part have been updated, there are some progress. Regarding 1, there should not be this kind of merging:
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Z</forename>
<surname>Li</surname>
</persName>
<affiliation key="aff16">
<orgName type="department">School of Physics and State Key Laboratory of Nuclear Physics and Technology</orgName>
<orgName type="institution">Peking University</orgName>
<address>
<postCode>100871</postCode>
<settlement>Beijing</settlement>
<country key="CN">China</country>
</address>
</affiliation>
</author>
Affiliation 10 was already correctly associated before the last updates, and affiliation 11 is now correctly recognized and assigned too:
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">F</forename>
<surname>Browne</surname>
</persName>
<affiliation key="aff4">
<orgName type="institution" key="instit1">RIKEN Nishina Center</orgName>
<orgName type="institution" key="instit2">RIKEN</orgName>
<address>
<addrLine>2-1 Hirosawa, Wako-shi</addrLine>
<postCode>351-0198</postCode>
<settlement>Saitama</settlement>
<country key="JP">Japan</country>
</address>
</affiliation>
<affiliation key="aff11">
<orgName type="department">School of Computing, Engineering and Mathematics</orgName>
<orgName type="institution">University of Brighton</orgName>
<address>
<postCode>BN2 4JG</postCode>
<settlement>Brighton</settlement>
<country key="GB">United Kingdom</country>
</address>
</affiliation>
</author>
Dears all,
I have a pdf document with many authors and associated affiliations (see attached document), and two kinds of affiliation extraction issues appears :
May be this caused by the training process (training data), but if there are other technical solutions it will be better. test.pdf