kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.6k stars 461 forks source link

Paragraph detection issue and missing text #174

Open de-code opened 7 years ago

de-code commented 7 years ago

Hi,

Grobid appears to be quite powerful.

eLife Sciences may have some kryptonite in the form of a test PDF file with various formatting options: https://github.com/elifesciences/XML-mapping/blob/master/elife-00666.pdf

When I let Grobid loose on it then I experienced some paragraph detection issue and probably consequently missing text. If you search for say 'Example of a small' then you will see it on page 12. The text from page 10 correctly skips page 11 that only has an image but then continues with the header of the box. i.e. "...eLife content is delivered to more repositories and it can be Box 1. Example of a small box..." It should have continued with the right column. It then continues with the following page 'We need to allow authors...' (omitting the sub header). The content of the Box 1 and the text of the right column doesn't seem to appear anywhere.

This may be a particular difficult PDF and there may be other issues. Is that likely something that would get fixed as part of Grobid or should that be addressed outside as a pre-processing step? And if so, what would be the best way to achieve some pre-processing to say tell Grobid what the paragraphs are and let it it do the annotation of the content?

Thank you

de-code commented 7 years ago

I noticed the same or something similar happens with the first PMC manuscript for example. However, the extracted XML looks a bit different depending on the parameters. The PMC evaluation doesn't include assets where as the default for fulltext extraction is to include it (took me a while to track that down).

So with assets enabled, it starts to loose paragraphs on page 237 which happens to be after a figure: "Using simulations from the combined..." is not in the TEI XML if assets are enabled (it is when assets are disabled using the -ignoreAssets parameter).

The three immediately following paragraphs affected as well: "This analysis provides..." "In the development of new drugs..." "It is currently unknown how different tumor types..."

Also "Open Access This article is distributed under the terms..." is not included in that case but not desired as a paragraph either. It is however included when assets are ignored (I guess it should be included as a license - it is tagged as back/ack in the training XML).

(I checked the pdftoxml output - it includes the missing text)

dominic-sps commented 7 years ago

Removal of content is a major issue. Appreciate any fix for this.

kermitt2 commented 7 years ago

Hello,

The reasons why some text might be missing are:

Both factors can play in combination... The reason why using the "asset" option has an impact is that when the so-called assets are explicitly extracted (assets means the bitmap and SVG embedded in the PDF), they are exploited to detect zones and figures, which has some consequences in having normal paragraphs misclassified for instance as figure captions.

What I have planed so far is to remove the "assets" option and the extraction of bitmap directly from the PDF - because it's not reliable, for instance we can have PDF files with several ten thousand bitmaps in it (in general one embedded bitmap file per image line...). It will be replaced by the extraction of figures and formulas based on coordinates after recognition of this structures.

The second axis of improvement is better PDF parsing and reading order for the PDF elements. This is work in progress with the pdf2xml fork.

Third way to improve that it to have better segmentation, which is mainly an issue of training data (there's a lack of training data for the segmentation model) and features (but we need more training data to introduce more features productively).

For instance @de-code, we could add a couple of additional training data for the segmentation model to capture eLife PDF layout, and this could make the processing of your PDF much more reliable.

de-code commented 7 years ago

Thank you for that.

I think in that case the text was missing all-together and didn't appear as a figure or table description either. I guess it would be interesting to debug that to find out what was actually causing it.

Using the assets / images to aid segmentation seems to make sense. At least I'd be interested in that.

The eLife PDF document itself is not so important itself. The XML will have already been created by then.

But of course it would be good to add training data. Maybe we can generate some. Should I raise another issue for that?

lfoppiano commented 6 years ago

After having debugged with such document, the extraction from pdf seems correct and the missing parts (the boxes) are ending up in the <annex> when passing through the segmenter:

CME:    5'-CTAGAAATTTGTACGTGCCACAGA cme:    C   CM  CME CME:    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   7   0   :'--    4   6   0   0   0   0   1   <body>
3'  3'  3'  3   3'  3'  3'  BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  CONTAINSDIGITS  0   0   0   0   0   0   0   0   7   0   '   1   0   0   0   0   0   1   <body>
Acknowledgements    Acknowledgements    acknowledgements    A   Ac  Ack Ackn    BLOCKSTART  PAGEIN  NEWFONT HIGHERFONT  0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   7   1   no  0   10  0   0   0   0   1   I-<acknowledgement>
Main    thanks  main    M   Ma  Mai Main    BLOCKSTART  PAGEIN  SAMEFONT    LOWERFONT   0   0   INITCAP NODIGIT 0   1   1   0   0   0   0   0   7   1   no  0   10  0   0   0   0   1   <acknowledgement>
We  thank   we  W   We  We  We  BLOCKSTART  PAGEIN  NEWFONT HIGHERFONT  0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   7   1   ,   1   10  0   0   0   0   1   <acknowledgement>
their   contributions.  their   t   th  the thei    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   7   1   .   1   4   0   0   0   0   1   <acknowledgement>
Box 1.  box B   Bo  Box Box BLOCKSTART  PAGEIN  NEWFONT HIGHERFONT  0   0   INITCAP NODIGIT 0   1   1   0   0   0   0   0   7   2   .   1   10  0   0   0   0   1   I-<annex>
box box box b   bo  box box BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   1   1   0   0   0   0   0   7   2   no  0   1   0   0   0   0   1   <annex>
Donec   rhoncus donec   D   Do  Don Done    BLOCKSTART  PAGEIN  NEWFONT LOWERFONT   0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   7   2   .   1   8   0   0   0   0   1   <annex>
vitae   enim    vitae   v   vi  vit vita    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   no  0   9   0   0   0   0   1   <annex>
arcu.   Pellentesque    arcu.   a   ar  arc arcu    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   .   1   9   0   0   0   0   1   <annex>
senectus    et  senectus    s   se  sen sene    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   -   1   9   0   0   0   0   1   <annex>
pis egestas.    pis p   pi  pis pis BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   .   1   9   0   0   0   0   1   <annex>
rutrum. Praesent    rutrum. r   ru  rut rutr    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   .,- 3   10  0   0   0   0   1   <annex>
tudin   purus   tudin   t   tu  tud tudi    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   .   1   8   0   0   0   0   1   <annex>
arcu.   Fusce   arcu.   a   ar  arc arcu    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   ... 3   9   0   0   0   0   1   <annex>
Suspendisse eu  suspendisse S   Su  Sus Susp    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   7   2   no  0   9   0   0   0   0   1   <annex>
imperdiet.  (see    imperdiet.  i   im  imp impe    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   .(,;.,  6   9   0   0   0   0   1   <annex>
2014;   and 2014;   2   20  201 2014    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  CONTAINSDIGITS  0   0   0   0   1   0   0   0   7   2   ;.,)    4   8   0   0   0   0   1   <annex>
ultrices    vehicula    ultrices    u   ul  ult ultr    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   2   ,-  2   9   0   0   0   0   1   <annex>
per suscipit.   per p   pe  per per BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   7   2   ..  2   7   0   0   0   0   1   <annex>
DOI:    https://doi.org/10.7554/eLife.00666.002 doi:    D   DO  DOI DOI:    BLOCKSTART  PAGEIN  NEWFONT LOWERFONT   0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   7   4   :://././..  10  10  0   0   0   0   1   <annex>
Box 2.  box B   Bo  Box Box BLOCKSTART  PAGEIN  NEWFONT HIGHERFONT  0   0   INITCAP NODIGIT 0   1   1   0   0   0   0   0   7   4   .   1   10  0   0   0   0   1   <annex>
This    box this    T   Th  Thi This    BLOCKSTART  PAGEIN  NEWFONT LOWERFONT   0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   7   5   -.,.    4   9   0   0   0   0   1   <annex>
vel rhoncus vel v   ve  vel vel BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ..  2   9   0   0   0   0   1   <annex>
faucibus.   Vivamus faucibus.   f   fa  fau fauc    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ..,,    4   10  0   0   0   0   1   <annex>
odio    purus   odio    o   od  odi odio    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ,.  2   9   0   0   0   0   1   <annex>
hendrerit.  Praesent    hendrerit.  h   he  hen hend    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ...-    4   10  0   0   0   0   1   <annex>
quam    lobortis    quam    q   qu  qua quam    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ,., 3   9   0   0   0   0   1   <annex>
commodo,    eros    commodo,    c   co  com comm    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ,,. 3   9   0   0   0   0   1   <annex>
efficitur   tincidunt.  efficitur   e   ef  eff effi    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   .,,,-   5   9   0   0   0   0   1   <annex>
pat velit   pat p   pa  pat pat BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   1   1   0   0   0   0   0   7   5   ..- 3   9   0   0   0   0   1   <annex>
lentesque,  ipsum   lentesque,  l   le  len lent    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ,,,.    4   9   0   0   0   0   1   <annex>
Vestibulum  sit vestibulum  V   Ve  Ves Vest    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   7   5   .   1   10  0   0   0   0   1   <annex>
urna    lobortis,   urna    u   ur  urn urna    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   5   ,;.,.,. 7   8   0   0   0   0   1   <annex>
Box 2-Figure    box B   Bo  Box Box BLOCKSTART  PAGEIN  NEWFONT LOWERFONT   0   0   INITCAP NODIGIT 0   1   1   0   0   0   0   0   7   9   -.  2   10  0   0   0   0   1   <annex>
DOI:    DOI:    doi:    D   DO  DOI DOI:    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   7   9   :   1   1   0   0   0   0   1   <annex>
Donec   rhoncus donec   D   Do  Don Done    BLOCKSTART  PAGEIN  NEWFONT HIGHERFONT  0   0   INITCAP NODIGIT 0   0   0   0   0   0   0   0   7   9   ..- 3   10  0   0   0   0   1   <annex>
tesque  habitant    tesque  t   te  tes tesq    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   9   .   1   9   0   0   0   0   1   <annex>
nunc    id  nunc    n   nu  nun nunc    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   9   .,. 3   9   0   0   0   0   1   <annex>
fermentum   arcu.   fermentum   f   fe  fer ferm    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   9   ... 3   9   0   0   0   0   1   <annex>
imperdiet.  Vestibulum  imperdiet.  i   im  imp impe    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   7   9   .,..    4   9   0   0   0   0   1   <annex>
DOI:    https://doi.org/10.7554/eLife.00666.002 doi:    D   DO  DOI DOI:    BLOCKSTART  PAGEIN  NEWFONT LOWERFONT   0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   7   11  :://././..  10  10  0   0   0   0   1   I-<header>
Darrshan-Sankar commented 2 months ago

I am feeling the same issue in 2024 too