izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
98 stars 46 forks source link

Uncatched process warnings and output #15

Closed jurrian closed 6 years ago

jurrian commented 6 years ago

Some PDF's seem to be (partly) unprocessable due to unknown symbols etc. Often this yields a lot of errors and warnings like the following: Syntax Warning: Invalid Font Weight

It would seem like this comes from poppler but I am not entirely sure. I tried catching an exception it but it's no exception, probably the warning is piped through stderr or stdout. Somehow it ends up in my logging, with no apparent way to filter it out.

Also, I tried the following to wrap around the function call:

s = StringIO.StringIO()
sys.stderr = s
sys.stdout = s

To me it seems that somewhere in this library the warnings are either generated or not catched. It would be good to keep these things out of logging.

izderadicka commented 6 years ago

Do you have some sample documents with the issue? What about text output - is it OK or is something missing/messed? The message look like it comes from poppler.

Re: warnings are either generated or not catched - Not clear what you mean - if it is uncatched exception then it should bring down program - otherwise it's some logging - as there is no logging and printing in python wrapper it must come from poppler library.

jurrian commented 6 years ago

When I use pdftotext this file gives the warning nine times, so it definitely comes from Poppler: example.pdf output.txt

The output seems to be relatively okay, given that the quality of the pdf is not too good. I found out that if I select the image on top some large characters appear (Uv Q. < < ^m^), it might be related to the warning.

I think the python wrapper is supposed to handle stdout and stderr output from the Poppler library, since there seems to be no way to catch it in my program. My python logger configuration catches all print statements and exceptions, except for there warnings. I am not sure how catching these warnings in the wrapper should work and if that's even possible, but if it can it would be really appreciated if it was passed on properly to the python logger.

izderadicka commented 6 years ago

Messages are from lib poppler and they are on stderr - from my perspective it's no issue - you can redirect stderr to /dev/null to ignore them. I guess libpoppler might have some options to suppress them - if you want to digg into it?

jurrian commented 6 years ago

I found out that the -q or quiet flag can be set by setErrQuiet(GBool errQuietA) in GlobalParams.h. Do you think it would be possible to add an extra argument to Document() that sets this flag? Should be a small addition right?