kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.54k stars 453 forks source link

Error case DOI with processHeaderDocument versus processFulltextDocument #916

Open kermitt2 opened 2 years ago

kermitt2 commented 2 years ago

Regarding header/metadata, the following PLOS article is correctly processed with processFulltextDocument service (correct DOI, journal, etc.)

However in case of processHeaderDocument, the wrong DOI is selected (the one for the data at Zenodo), despite correct title and first author.

journal.pone.0263302.pdf

thanks @Aazhar for the error case

lfoppiano commented 4 days ago

With version 0.8.1 this does not happens, however with CRF models the DOI has prepended the string e0263302.

Header only

With CRF only, we've got:

                    </monogr>
                    <idno type="MD5">C1A860C14E064D1C9E586BCFC5463C92</idno>
                    <idno type="DOI">e0263302.10.1371/journal.pone.0263302</idno>
                    <note type="submission">Received: July 27, 2021 Accepted: January 16, 2022</note>
                </biblStruct>

With DL:

                    <monogr>
                        <imprint>
                            <date type="published" when="2022-01-28">January 28, 2022</date>
                        </imprint>
                    </monogr>
                    <idno type="MD5">C1A860C14E064D1C9E586BCFC5463C92</idno>
                    <idno type="DOI">10.1371/journal.pone.0263302</idno>
                    <note type="submission">Received: July 27, 2021 Accepted: January 16, 2022</note>
                </biblStruct>

Fulltext:

with CRF:

                        <imprint>
                            <date type="published" when="2022-01-28">January 28, 2022</date>
                        </imprint>
                    </monogr>
                    <idno type="MD5">C1A860C14E064D1C9E586BCFC5463C92</idno>
                    <idno type="DOI">e0263302.10.1371/journal.pone.0263302</idno>
                    <note type="submission">Received: July 27, 2021 Accepted: January 16, 2022</note>
                </biblStruct>

With DL:

                        <imprint>
                            <date type="published" when="2022-01-28">January 28, 2022</date>
                        </imprint>
                    </monogr>
                    <idno type="MD5">C1A860C14E064D1C9E586BCFC5463C92</idno>
                    <idno type="DOI">10.1371/journal.pone.0263302</idno>
                    <note type="submission">Received: July 27, 2021 Accepted: January 16, 2022</note>
                </biblStruct>
lfoppiano commented 4 days ago

The reason is that the CRF header model consider the prefix 03.... as part of the publication number:

PLoS    plos    P   PL  PLo PLoS    S   oS  LoS PLoS    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   NOPUNCT 0   0   1   0   <reference>
ONE one O   ON  ONE ONE E   NE  ONE ONE BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <reference>
17  17  1   17  17  17  7   17  17  17  BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  ALLDIGIT    0   0   0   0   0   0   0   0   NOPUNCT 0   0   1   0   <reference>
(   (   (   (   (   (   (   (   (   (   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   OPENBRACKET 0   0   1   0   <reference>
1   1   1   1   1   1   1   1   1   1   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  ALLDIGIT    1   0   0   0   0   1   0   0   NOPUNCT 0   0   1   0   <reference>
)   )   )   )   )   )   )   )   )   )   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   ENDBRACKET  0   0   1   0   <reference>
:   :   :   :   :   :   :   :   :   :   BLOCKIN LINEEND ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   PUNCT   0   0   1   0   <reference>
e0263302    e0263302    e   e0  e02 e026    2   02  302 3302    BLOCKIN LINESTART   ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  CONTAINSDIGITS  0   0   0   1   0   0   0   0   NOPUNCT 0   0   1   0   I-<pubnum>
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   DOT 0   0   1   0   <pubnum>
https   https   h   ht  htt http    s   ps  tps ttps    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
:   :   :   :   :   :   :   :   :   :   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   PUNCT   0   0   1   0   <pubnum>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
doi doi d   do  doi doi i   oi  doi doi BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   1   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   DOT 0   0   1   0   <pubnum>
org org o   or  org org g   rg  org org BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   1   0   1   NOPUNCT 0   0   1   0   <pubnum>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
10  10  1   10  10  10  0   10  10  10  BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  ALLDIGIT    0   0   0   0   0   1   0   1   NOPUNCT 0   0   1   0   <pubnum>
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   DOT 0   0   1   0   <pubnum>
1371    1371    1   13  137 1371    1   71  371 1371    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  ALLDIGIT    0   0   0   1   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
/   /   /   /   /   /   /   /   /   /   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
journal journal j   jo  jou jour    l   al  nal rnal    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   1   NOPUNCT 0   0   1   0   <pubnum>
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEEND ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   DOT 0   0   1   0   <pubnum>
pone    pone    p   po  pon pone    e   ne  one pone    BLOCKIN LINESTART   ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   NOPUNCT 0   0   1   0   <pubnum>
.   .   .   .   .   .   .   .   .   .   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   DOT 0   0   1   0   <pubnum>
0263302 0263302 0   02  026 0263    2   02  302 3302    BLOCKEND    LINEEND ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  ALLDIGIT    0   0   0   1   0   0   0   0   NOPUNCT 0   0   1   0   <pubnum>
Editor  editor  E   Ed  Edi Edit    r   or  tor itor    BLOCKSTART  LINESTART   ALIGNEDLEFT NEWFONT SAMEFONTSIZE    1   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   I-<other>
:   :   :   :   :   :   :   :   :   :   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    1   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   PUNCT   0   0   1   0   <other>
Primo   primo   P   Pr  Pri Prim    o   mo  imo rimo    BLOCKIN LINEIN  ALIGNEDLEFT NEWFONT SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   NOPUNCT 0   0   1   0   I-<editor>