kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Annex and body misclassification #1198

Open lfoppiano opened 2 weeks ago

lfoppiano commented 2 weeks ago

In this article, the segmentation model, after having identified correctly the annex (starting at STAR+METHODS, misclassify a table as header (REAGENT) and the following page as body (foveated).

tivity  in  tivity  t   ti  tiv tivi    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   9   11  .....,  6   8   0   0   0   0   1   <references>
E3516-E3525.    E3516-E3525.    e3516-e3525.    E   E3  E35 E351    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  CONTAINSDIGITS  0   0   0   0   0   0   0   0   9   11  -.  2   1   0   0   0   0   1   <references>
990 Current 990 9   99  990 990 BLOCKSTART  PAGEEND NEWFONT HIGHERFONT  0   0   NOCAPS  ALLDIGIT    0   0   0   0   0   0   0   0   9   11  ,-,,    4   10  0   0   1   0   0   I-<page>
STAR+METHODS    STAR+METHODS    star+methods    S   ST  STA STAR    BLOCKSTART  PAGESTART   NEWFONT HIGHERFONT  0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   9   0   no  0   10  0   0   0   0   1   I-<annex>
KEY RESOURCE    key K   KE  KEY KEY BLOCKSTART  PAGEIN  SAMEFONT    LOWERFONT   0   0   ALLCAP  NODIGIT 0   1   1   0   0   0   0   0   9   0   no  0   10  0   1   0   0   1   <annex>
CONTACT FOR contact C   CO  CON CONT    BLOCKSTART  PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   1   0   0   0   0   0   9   0   no  0   10  0   1   0   0   1   <annex>
Further information further F   Fu  Fur Furt    BLOCKSTART  PAGEIN  NEWFONT SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   9   0   /,(.@   5   10  0   1   0   0   1   <annex>
gmail.com). gmail.com). gmail.com). g   gm  gma gmai    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   9   0   .). 3   0   0   1   0   0   1   <annex>
EXPERIMENTAL    MODEL   experimental    E   EX  EXP EXPE    BLOCKSTART  PAGEIN  NEWFONT SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   1   0   0   0   0   0   9   0   no  0   10  0   1   0   0   1   <annex>
Ten human   ten T   Te  Ten Ten BLOCKSTART  PAGEIN  NEWFONT SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   9   0   ,(;-)   5   10  0   0   0   0   1   <annex>
in  the in  i   in  in  in  BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   1   1   0   0   0   0   0   9   0   ()..    4   8   0   0   0   0   1   <annex>
METHOD  DETAILS method  M   ME  MET METH    BLOCKSTART  PAGEIN  NEWFONT SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   1   0   0   0   0   0   9   1   no  0   10  0   0   0   0   1   <annex>
Model   overview    model   M   Mo  Mod Mode    BLOCKSTART  PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   9   1   no  0   1   0   0   0   0   1   <annex>
All non-grid    all A   Al  All All BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   9   1   -.  2   9   0   0   0   0   1   <annex>
rate    maps    rate    r   ra  rat rate    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   (),(    4   10  0   0   0   0   1   <annex>
dynamics    see dynamics    d   dy  dyn dyna    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ..,[,]).    8   2   0   0   0   0   1   <annex>
Grayscale   images  grayscale   G   Gr  Gra Gray    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   9   1   ()-.    4   9   0   0   0   0   1   <annex>
components, the components, c   co  com comp    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ,:,-(   5   9   0   0   0   0   1   <annex>
cells/pixels)   with    cells/pixels)   c   ce  cel cell    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   9   1   /)();''(    8   9   0   0   0   0   1   <annex>
within  a   within  w   wi  wit with    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ,);(    4   9   0   0   0   0   1   <annex>
stimulus,   see stimulus,   s   st  sti stim    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ,..,[,,]).  10  2   0   0   0   0   1   <annex>
Banks   of  banks   B   Ba  Ban Bank    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   1   1   0   0   0   0   0   9   1   (,) 3   9   0   0   0   0   1   <annex>
cells   with    cells   c   ce  cel cell    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ().()   5   9   0   0   0   0   1   <annex>
of  approximately   of  o   of  of  of  BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   [,].,,- 7   9   0   0   0   0   1   <annex>
rons    express rons    r   ro  ron rons    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   9   1   ,,().   5   8   0   0   0   0   1   <annex>
Stimuli (the    stimuli S   St  Sti Stim    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   9   1   ()().   5   9   0   0   0   0   1   <annex>
between the between b   be  bet betw    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   .,(..,,,.)- 11  9   0   0   0   0   1   <annex>
ulus    is  ulus    u   ul  ulu ulus    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   9   1   .   1   9   0   0   0   0   1   <annex>
feature detectors   feature f   fe  fea feat    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ()()    4   9   0   0   0   0   1   <annex>
is  then    is  i   is  is  is  BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   .,( 3   9   0   0   0   0   1   <annex>
by  grid    by  b   by  by  by  BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ).,---  6   10  0   0   0   0   1   <annex>
stimulus    identity    stimulus    s   st  sti stim    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ,., 3   9   0   0   0   0   1   <annex>
small   number  small   s   sm  sma smal    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   1   1   0   0   0   0   0   9   1   ,   1   9   0   0   0   0   1   <annex>
feature label   feature f   fe  fea feat    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   ().,    4   9   0   0   0   0   1   <annex>
stimuli from    stimuli s   st  sti stim    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   1   .   1   2   0   0   0   0   1   <annex>
Action/Perception   cycles  action/perception   A   Ac  Act Acti    BLOCKSTART  PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   9   8   /   1   1   0   0   0   0   1   <annex>
An  action-perception   an  A   An  An  An  BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   1   1   0   0   0   0   0   9   8   -:( 3   9   0   0   0   0   1   <annex>
feature is  feature f   fe  fea feat    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   8   -,).()  6   8   0   0   0   0   1   <annex>
feature label   feature f   fe  fea feat    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   8   ()..    4   9   0   0   0   0   1   <annex>
to  the to  t   to  to  to  BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   8   .   1   10  0   0   0   0   1   <annex>
ensure  a   ensure  e   en  ens ensu    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   8   .,  2   9   0   0   0   0   1   <annex>
competing   hypotheses  competing   c   co  com comp    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   8   .   1   9   0   0   0   0   1   <annex>
the next    the t   th  the the BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   8   (). 3   9   0   0   0   0   1   <annex>
activity,   which   activity,   a   ac  act acti    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   9   8   ,-- 3   8   0   0   0   0   1   <annex>
[26,    76].    [26,    [   [2  [26 [26,    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  CONTAINSDIGITS  0   0   0   0   0   0   0   0   9   8   [,]..(  6   9   0   0   0   0   1   <annex>
REAGENT or  reagent R   RE  REA REAG    BLOCKSTART  PAGEIN  NEWFONT LOWERFONT   0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   9   11  no  0   10  0   1   0   0   1   I-<header>
SOURCE  SOURCE  source  S   SO  SOU SOUR    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   1   0   0   0   0   0   9   11  no  0   3   0   1   0   0   1   <header>
IDENTIFYIER IDENTIFYIER identifyier I   ID  IDE IDEN    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   9   11  no  0   6   0   1   0   0   1   <header>
Software    and software    S   So  Sof Soft    BLOCKSTART  PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   9   11  no  0   10  0   1   0   0   1   <header>
MATLAB  R2017b  matlab  M   MA  MAT MATL    BLOCKSTART  PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   9   11  no  0   5   0   1   0   0   1   <header>
https://www.mathworks.com/  https://www.mathworks.com/  https://www.mathworks.com/  h   ht  htt http    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   9   11  ://../  6   10  0   1   0   0   1   <header>
https://www.mathworks.com/  https://www.mathworks.com/  https://www.mathworks.com/  h   ht  htt http    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   9   11  ://../  6   10  0   1   0   0   1   <header>
Custom  MATLAB  custom  C   Cu  Cus Cust    BLOCKSTART  PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   10  11  no  0   6   0   1   0   0   1   <header>
This    Article this    T   Th  Thi This    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   10  11  no  0   4   0   1   0   0   1   <header>
https://github.com/bicanski https://github.com/bicanski https://github.com/bicanski h   ht  htt http    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   1   10  11  ://./   5   10  0   1   0   0   1   <header>
Current Biology current C   Cu  Cur Curr    BLOCKSTART  PAGEEND SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   10  11  ,-.-,,  6   10  0   0   1   1   0   <header>
foveated    feature)    foveated    f   fo  fov fove    BLOCKSTART  PAGESTART   SAMEFONT    HIGHERFONT  0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   10  0   )() 3   9   0   0   0   0   1   I-<body>
the next    the t   th  the the BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   10  0   -().    4   9   0   0   0   0   1   <body>
Randomness  is  randomness  R   Ra  Ran Rand    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   10  0   ---,    4   9   0   0   0   0   1   <body>
associated  with    associated  a   as  ass asso    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   10  0   (). 3   9   0   0   0   0   1   <body>

The PDF (CC-BY): 9_10.1016_j.cub.2019.01.077.pdf