Open de-code opened 7 years ago
I noticed the same or something similar happens with the first PMC manuscript for example. However, the extracted XML looks a bit different depending on the parameters. The PMC evaluation doesn't include assets where as the default for fulltext extraction is to include it (took me a while to track that down).
So with assets enabled, it starts to loose paragraphs on page 237 which happens to be after a figure:
"Using simulations from the combined..." is not in the TEI XML if assets are enabled (it is when assets are disabled using the -ignoreAssets
parameter).
The three immediately following paragraphs affected as well: "This analysis provides..." "In the development of new drugs..." "It is currently unknown how different tumor types..."
Also "Open Access This article is distributed under the terms..." is not included in that case but not desired as a paragraph either. It is however included when assets are ignored (I guess it should be included as a license - it is tagged as back/ack in the training XML).
(I checked the pdftoxml output - it includes the missing text)
Removal of content is a major issue. Appreciate any fix for this.
Hello,
The reasons why some text might be missing are:
pdf parsing issues, in particular problem with capturing the reading order,
misclassification of zones at the segmentation level (for instance a paragraph is wrongly identified as annex) and at full text level (for instance a paragraph is wrongly recognized as figure caption).
Both factors can play in combination... The reason why using the "asset" option has an impact is that when the so-called assets are explicitly extracted (assets means the bitmap and SVG embedded in the PDF), they are exploited to detect zones and figures, which has some consequences in having normal paragraphs misclassified for instance as figure captions.
What I have planed so far is to remove the "assets" option and the extraction of bitmap directly from the PDF - because it's not reliable, for instance we can have PDF files with several ten thousand bitmaps in it (in general one embedded bitmap file per image line...). It will be replaced by the extraction of figures and formulas based on coordinates after recognition of this structures.
The second axis of improvement is better PDF parsing and reading order for the PDF elements. This is work in progress with the pdf2xml
fork.
Third way to improve that it to have better segmentation, which is mainly an issue of training data (there's a lack of training data for the segmentation model) and features (but we need more training data to introduce more features productively).
For instance @de-code, we could add a couple of additional training data for the segmentation model to capture eLife PDF layout, and this could make the processing of your PDF much more reliable.
Thank you for that.
I think in that case the text was missing all-together and didn't appear as a figure or table description either. I guess it would be interesting to debug that to find out what was actually causing it.
Using the assets / images to aid segmentation seems to make sense. At least I'd be interested in that.
The eLife PDF document itself is not so important itself. The XML will have already been created by then.
But of course it would be good to add training data. Maybe we can generate some. Should I raise another issue for that?
After having debugged with such document, the extraction from pdf seems correct and the missing parts (the boxes) are ending up in the <annex>
when passing through the segmenter:
CME: 5'-CTAGAAATTTGTACGTGCCACAGA cme: C CM CME CME: BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 ALLCAP NODIGIT 0 0 0 0 0 0 0 0 7 0 :'-- 4 6 0 0 0 0 1 <body>
3' 3' 3' 3 3' 3' 3' BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 ALLCAP CONTAINSDIGITS 0 0 0 0 0 0 0 0 7 0 ' 1 0 0 0 0 0 1 <body>
Acknowledgements Acknowledgements acknowledgements A Ac Ack Ackn BLOCKSTART PAGEIN NEWFONT HIGHERFONT 0 0 INITCAP NODIGIT 0 0 1 0 0 0 0 0 7 1 no 0 10 0 0 0 0 1 I-<acknowledgement>
Main thanks main M Ma Mai Main BLOCKSTART PAGEIN SAMEFONT LOWERFONT 0 0 INITCAP NODIGIT 0 1 1 0 0 0 0 0 7 1 no 0 10 0 0 0 0 1 <acknowledgement>
We thank we W We We We BLOCKSTART PAGEIN NEWFONT HIGHERFONT 0 0 INITCAP NODIGIT 0 0 1 0 0 0 0 0 7 1 , 1 10 0 0 0 0 1 <acknowledgement>
their contributions. their t th the thei BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 7 1 . 1 4 0 0 0 0 1 <acknowledgement>
Box 1. box B Bo Box Box BLOCKSTART PAGEIN NEWFONT HIGHERFONT 0 0 INITCAP NODIGIT 0 1 1 0 0 0 0 0 7 2 . 1 10 0 0 0 0 1 I-<annex>
box box box b bo box box BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 1 1 0 0 0 0 0 7 2 no 0 1 0 0 0 0 1 <annex>
Donec rhoncus donec D Do Don Done BLOCKSTART PAGEIN NEWFONT LOWERFONT 0 0 INITCAP NODIGIT 0 0 0 0 0 0 0 0 7 2 . 1 8 0 0 0 0 1 <annex>
vitae enim vitae v vi vit vita BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 no 0 9 0 0 0 0 1 <annex>
arcu. Pellentesque arcu. a ar arc arcu BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 . 1 9 0 0 0 0 1 <annex>
senectus et senectus s se sen sene BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 - 1 9 0 0 0 0 1 <annex>
pis egestas. pis p pi pis pis BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 . 1 9 0 0 0 0 1 <annex>
rutrum. Praesent rutrum. r ru rut rutr BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 .,- 3 10 0 0 0 0 1 <annex>
tudin purus tudin t tu tud tudi BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 . 1 8 0 0 0 0 1 <annex>
arcu. Fusce arcu. a ar arc arcu BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 ... 3 9 0 0 0 0 1 <annex>
Suspendisse eu suspendisse S Su Sus Susp BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 INITCAP NODIGIT 0 0 0 0 0 0 0 0 7 2 no 0 9 0 0 0 0 1 <annex>
imperdiet. (see imperdiet. i im imp impe BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 .(,;., 6 9 0 0 0 0 1 <annex>
2014; and 2014; 2 20 201 2014 BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 ALLCAP CONTAINSDIGITS 0 0 0 0 1 0 0 0 7 2 ;.,) 4 8 0 0 0 0 1 <annex>
ultrices vehicula ultrices u ul ult ultr BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 2 ,- 2 9 0 0 0 0 1 <annex>
per suscipit. per p pe per per BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 7 2 .. 2 7 0 0 0 0 1 <annex>
DOI: https://doi.org/10.7554/eLife.00666.002 doi: D DO DOI DOI: BLOCKSTART PAGEIN NEWFONT LOWERFONT 0 0 ALLCAP NODIGIT 0 0 0 0 0 0 0 0 7 4 :://././.. 10 10 0 0 0 0 1 <annex>
Box 2. box B Bo Box Box BLOCKSTART PAGEIN NEWFONT HIGHERFONT 0 0 INITCAP NODIGIT 0 1 1 0 0 0 0 0 7 4 . 1 10 0 0 0 0 1 <annex>
This box this T Th Thi This BLOCKSTART PAGEIN NEWFONT LOWERFONT 0 0 INITCAP NODIGIT 0 0 1 0 0 0 0 0 7 5 -.,. 4 9 0 0 0 0 1 <annex>
vel rhoncus vel v ve vel vel BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 .. 2 9 0 0 0 0 1 <annex>
faucibus. Vivamus faucibus. f fa fau fauc BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 ..,, 4 10 0 0 0 0 1 <annex>
odio purus odio o od odi odio BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 ,. 2 9 0 0 0 0 1 <annex>
hendrerit. Praesent hendrerit. h he hen hend BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 ...- 4 10 0 0 0 0 1 <annex>
quam lobortis quam q qu qua quam BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 ,., 3 9 0 0 0 0 1 <annex>
commodo, eros commodo, c co com comm BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 ,,. 3 9 0 0 0 0 1 <annex>
efficitur tincidunt. efficitur e ef eff effi BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 .,,,- 5 9 0 0 0 0 1 <annex>
pat velit pat p pa pat pat BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 1 1 0 0 0 0 0 7 5 ..- 3 9 0 0 0 0 1 <annex>
lentesque, ipsum lentesque, l le len lent BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 ,,,. 4 9 0 0 0 0 1 <annex>
Vestibulum sit vestibulum V Ve Ves Vest BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 INITCAP NODIGIT 0 0 0 0 0 0 0 0 7 5 . 1 10 0 0 0 0 1 <annex>
urna lobortis, urna u ur urn urna BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 5 ,;.,.,. 7 8 0 0 0 0 1 <annex>
Box 2-Figure box B Bo Box Box BLOCKSTART PAGEIN NEWFONT LOWERFONT 0 0 INITCAP NODIGIT 0 1 1 0 0 0 0 0 7 9 -. 2 10 0 0 0 0 1 <annex>
DOI: DOI: doi: D DO DOI DOI: BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 ALLCAP NODIGIT 0 0 0 0 0 0 0 0 7 9 : 1 1 0 0 0 0 1 <annex>
Donec rhoncus donec D Do Don Done BLOCKSTART PAGEIN NEWFONT HIGHERFONT 0 0 INITCAP NODIGIT 0 0 0 0 0 0 0 0 7 9 ..- 3 10 0 0 0 0 1 <annex>
tesque habitant tesque t te tes tesq BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 9 . 1 9 0 0 0 0 1 <annex>
nunc id nunc n nu nun nunc BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 9 .,. 3 9 0 0 0 0 1 <annex>
fermentum arcu. fermentum f fe fer ferm BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 9 ... 3 9 0 0 0 0 1 <annex>
imperdiet. Vestibulum imperdiet. i im imp impe BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 7 9 .,.. 4 9 0 0 0 0 1 <annex>
DOI: https://doi.org/10.7554/eLife.00666.002 doi: D DO DOI DOI: BLOCKSTART PAGEIN NEWFONT LOWERFONT 0 0 ALLCAP NODIGIT 0 0 0 0 0 0 0 0 7 11 :://././.. 10 10 0 0 0 0 1 I-<header>
I am feeling the same issue in 2024 too
Hi,
Grobid appears to be quite powerful.
eLife Sciences may have some kryptonite in the form of a test PDF file with various formatting options: https://github.com/elifesciences/XML-mapping/blob/master/elife-00666.pdf
When I let Grobid loose on it then I experienced some paragraph detection issue and probably consequently missing text. If you search for say 'Example of a small' then you will see it on page 12. The text from page 10 correctly skips page 11 that only has an image but then continues with the header of the box. i.e. "...eLife content is delivered to more repositories and it can be Box 1. Example of a small box..." It should have continued with the right column. It then continues with the following page 'We need to allow authors...' (omitting the sub header). The content of the Box 1 and the text of the right column doesn't seem to appear anywhere.
This may be a particular difficult PDF and there may be other issues. Is that likely something that would get fixed as part of Grobid or should that be addressed outside as a pre-processing step? And if so, what would be the best way to achieve some pre-processing to say tell Grobid what the paragraphs are and let it it do the annotation of the content?
Thank you