Fix PDF image extraction + predictable image paths - Githubissues

pkp / ots

PKP XML Parsing Service

GNU General Public License v3.0

32 stars 19 forks source link

Fix PDF image extraction + predictable image paths #85

Closed axfelix closed 6 years ago

axfelix commented 8 years ago

PDF image extraction doesn't work at the moment -- I'm not sure if Grobid or Cermine implement it at all, but PDF images aren't extracted from most documents in our corpus. We need to add a call to pdfimages (from xpdf/poppler) for documents that are uploaded as PDF and not passed through meTypeset. We should make sure the output location for these matches that of meTypeset (currently /var/documents/$user/$job/metypeset/media/image#.png) so that these image URLs are predictable for Substance.

axfelix commented 8 years ago

@kaschioudi - no hurry on this unless you're blocked on other issues.

axfelix commented 7 years ago

OK, so, pdfimages generally works for this on some test documents, using the syntax pdfimages -j file.doc image, where "image" is the file name output prefix and file.doc is the input. Some of the output is in "ppm" rather than jpeg format, but imagemagick can fix that easily, e.g.: for x in $(ls *.ppm); do magick $x $(echo $x | sed -e "s/\.ppm/.jpg/g"); done; rm *.ppm.

Shouldn't be too hard to implement this in its own module, then add a call to the merge module that adds image elements to the end of the Body text for any PDF that's missing them?

axfelix commented 7 years ago

I've mostly implemented this on the pdfimages branch. I still need a) a general cleanup and sanity check over the code, and b) a good way of passing the list of pdfimages from the Extraction module to the Merge module that re-adds them.

axfelix commented 7 years ago

Removed the branch because this was implemented upstream in Cermine https://github.com/CeON/CERMINE/issues/34#issuecomment-270446925 -- now we just need to make sure that Cermine output images are moved to the same path as meTypeset. Currently cermine output images are in a folder called documentname.images and meTypeset's are in metypeset/media

axfelix commented 7 years ago

Looking into this...

We need to switch from using the PdfNLMContentExtractor class in Cermine to the ContentExtractor class (in https://github.com/pkp/xmlps/blob/master/module/Cermine/src/Cermine/Model/Converter/Cermine.php) to benefit from image extraction support.

However, trying to do this with an upstream Cermine build causes the converter to fail in our stack. Upstream Cermine builds still work fine when being called with the legacy(?) PdfNLMContextExtractor class. From looking at our code, I initially thought that this was because the old method only produces one output file and thus was designed to be pipe-able, which it appears to be from how we're using it: https://github.com/pkp/xmlps/blob/master/module/Cermine/src/Cermine/Model/Converter/Cermine.php#L103. However, I can't reproduce this piping behaviour when running upstream builds of Cermine locally, which I'm now really confused about, as I'm not sure how it continues to work in our pipeline...

@kaschioudi , if you have any ideas...

axfelix commented 7 years ago

Thanks! Going to review this. We should probably make sure meTypeset and Cermine images are output to the same path moving forward so that we don't need to think about different relative links in XML output.

axfelix commented 7 years ago

This is working, and merged into master. Leaving open until we harmonized the image output paths though.

axfelix commented 6 years ago

I believe this was fixed in the most recent round of Texture work through the meta file wrapper.