Closed jurrian closed 6 years ago
Do you have some sample documents with the issue? What about text output - is it OK or is something missing/messed? The message look like it comes from poppler.
Re: warnings are either generated or not catched - Not clear what you mean - if it is uncatched exception then it should bring down program - otherwise it's some logging - as there is no logging and printing in python wrapper it must come from poppler library.
When I use pdftotext
this file gives the warning nine times, so it definitely comes from Poppler:
example.pdf output.txt
The output seems to be relatively okay, given that the quality of the pdf is not too good. I found out that if I select the image on top some large characters appear (Uv Q. < < ^m^), it might be related to the warning.
I think the python wrapper is supposed to handle stdout and stderr output from the Poppler library, since there seems to be no way to catch it in my program. My python logger configuration catches all print statements and exceptions, except for there warnings. I am not sure how catching these warnings in the wrapper should work and if that's even possible, but if it can it would be really appreciated if it was passed on properly to the python logger.
Messages are from lib poppler and they are on stderr - from my perspective it's no issue - you can redirect stderr to /dev/null to ignore them. I guess libpoppler might have some options to suppress them - if you want to digg into it?
I found out that the -q
or quiet flag can be set by setErrQuiet(GBool errQuietA)
in GlobalParams.h. Do you think it would be possible to add an extra argument to Document() that sets this flag? Should be a small addition right?
Some PDF's seem to be (partly) unprocessable due to unknown symbols etc. Often this yields a lot of errors and warnings like the following:
Syntax Warning: Invalid Font Weight
It would seem like this comes from poppler but I am not entirely sure. I tried catching an exception it but it's no exception, probably the warning is piped through stderr or stdout. Somehow it ends up in my logging, with no apparent way to filter it out.
Also, I tried the following to wrap around the function call:
To me it seems that somewhere in this library the warnings are either generated or not catched. It would be good to keep these things out of logging.