Parse PDF w/o images? - Githubissues

mikey21211 commented 6 years ago

Is it possible to use the TEI and parse the full text but ignore the figures/images within the PDF? Thanks

kermitt2 commented 6 years ago

Hello @mikey21211 Do you have a use case to understand what exactly you want to achieve or avoid?
I probably did not understood your goal, but you can always select the content of interest in the resulting XML/TEI and ignore some elements during the XML parsing... When processing a PDF, it makes sense to consider the possibility of figures to distinguish those elements/blocks from the the rest of the content body.

mikey21211 commented 6 years ago

That makes sense what you're saying, and I think that the figures might be throwing off my result. Is there a way to ignore the figures? Or rather, strip away the images within my pdf before the parse?

On Feb 20, 2018 12:40 AM, "Patrice Lopez" notifications@github.com wrote:

Hello @mikey21211 https://github.com/mikey21211 Do you have a use case to understand what exactly you want to achieve or avoid? I probably not understood your goal, but you can always select the content of interest in the resulting XML/TEI and ignore some elements during the XML parsing... When processing a PDF, it makes sense to consider the possibility of figures to distinguish those elements/blocks from the the rest of the content body.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kermitt2/grobid/issues/288#issuecomment-366881531, or mute the thread https://github.com/notifications/unsubscribe-auth/Ai8lr5aBz934n2DofUITrHX6AkerEevnks5tWmjRgaJpZM4SLALV .

nemobis commented 6 years ago

Be careful what you wish for. :) You could first of all use pdfimages to see what "all images" means for your PDF. It might be more than you think.

kermitt2 / grobid

Parse PDF w/o images? #288