kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Data availability tokens misclassified #1197

Open lfoppiano opened 2 weeks ago

lfoppiano commented 2 weeks ago

This issue happens with the Delft Models where the final part of the availability statement is misclassified as <abstract>. With the CRF model the availability statement is truncated at the end of the page. So, in principle, having this document as training data will benefit both architectures.

image

PDF (CC-BY): 11_10.1371_journal.pone.0215651.pdf

sufficient  sufficient  s   su  suf suff    t   nt  ent ient    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <availability>
for for f   fo  for for r   or  for for BLOCKIN LINEEND ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <availability>
calibration calibration c   ca  cal cali    n   on  ion tion    BLOCKIN LINESTART   ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <availability>
and and a   an  and and d   nd  and and BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <availability>
validation  validation  v   va  val vali    n   on  ion tion    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <availability>
of  of  o   of  of  of  f   of  of  of  BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   1   0   0   NOPUNCT 0   0   1   0   <abstract>
the the t   th  the the e   he  the the BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <abstract>
model   model   m   mo  mod mode    l   el  del odel    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   1   0   0   NOPUNCT 0   0   1   0   <abstract>
.   .   .   .   .   .   .   .   .   .   BLOCKEND    LINEEND ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   DOT 0   0   1   0   <abstract>
Funding funding F   Fu  Fun Fund    g   ng  ing ding    BLOCKSTART  LINESTART   ALIGNEDLEFT NEWFONT SAMEFONTSIZE    1   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <other>
:   :   :   :   :   :   :   :   :   :   BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    1   0   ALLCAP  NODIGIT 1   0   0   0   0   0   0   0   PUNCT   0   0   1   0   <other>
The the T   Th  The The e   he  The The BLOCKIN LINEIN  ALIGNEDLEFT NEWFONT SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   I-<funding>
authors authors a   au  aut auth    s   rs  ors hors    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <funding>
received    received    r   re  rec rece    d   ed  ved ived    BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <funding>
no  no  n   no  no  no  o   no  no  no  BLOCKIN LINEIN  ALIGNEDLEFT SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   NOPUNCT 0   0   1   0   <funding>