Closed quikssb closed 10 months ago
Ok got it now...
The green marking (which was done after creating the .pdf) caused this issue. You never stop learning.. Sorry for opening this issue.
No problem. Was the file with the green marking corrupted, or is pdfium behaving incorrectly?
We read the .pdf in binary mode (not doing that leads to other exceptions):
Why don't you just pass the input filepath directly to the PdfDocument
constructor -- or does that not work?
Note, you cannot open a PDF from a non-filepath string. If inputting in-memory data, it must be binary (PDF programmatically is not a text format, so it cannot be validly decoded).
No problem. Was the file with the green marking corrupted, or is pdfium behaving incorrectly?
We read the .pdf in binary mode (not doing that leads to other exceptions):
Why don't you just pass the input filepath directly to the
PdfDocument
constructor -- or does that not work? Note, you cannot open a PDF from a non-filepath string. If inputting in-memory data, it must be binary (PDF programmatically is not a text format, so it cannot be validly decoded).
It seems the green marking makes those special characters such as °
or ±
hard to detect for the library. I am not even sure, if the library is to blame here at this point though.
When trying to read the pdf file (which is locally stored on hard drive) with a usual input path string, but not in binary mode, I got some exceptions when reading out the pdf-file. I don't remember anymore the exact exceptions. Anyways thanks for replying, it's all good now.
When trying to read the pdf file (which is locally stored on hard drive) with a usual input path string, but not in binary mode, I got some exceptions when reading out the pdf-file. I don't remember anymore the exact exceptions. Anyways thanks for replying, it's all good now.
Yet I'd like to know what is at stake here. All input modes are supposed to work, and FWIW they seem to do in our test suite. If one input mode works but another doesn't, a traceback + steps to reproduce might be valuable information for us. Are you able to change back to a string path (that has to be passed directly to PdfDocument()
without open()
) and restore the exception with the file in question?
As an aside, the original issue title reminds me much of https://github.com/pypdfium2-team/pypdfium2/discussions/288 (which incidentally appeared at almost the same time).
Inconsistent rendering or text extraction across different machines can hint at a missing system font, if the pdf in question is not embedding it.
When trying to read the pdf file (which is locally stored on hard drive) with a usual input path string, but not in binary mode, I got some exceptions when reading out the pdf-file. I don't remember anymore the exact exceptions. Anyways thanks for replying, it's all good now.
Yet I'd like to know what is at stake here. All input modes are supposed to work, and FWIW they seem to do in our test suite. If one input mode works but another doesn't, a traceback + steps to reproduce might be valuable information for us. Are you able to change back to a string path (that is passed directly to
PdfDocument()
withoutopen()
) and restore the exception with the file in question?
Will try to get back to you tomorrow at work.
Package origin
pypdfium2
fromPyPI
orGitHub/pypdfium2-team
.Description
I am parsing text out of a pdf file, see:
It contains a special character, which is
°
. When parsing however I get the following value20~25ÅC
instead of20-25°C
. So°
gets parsed asÅ
. Strangely enough my colleague is using exactly the same code, and he gets parsed out the value as intended.We read the .pdf in binary mode (not doing that leads to other exceptions):
and extract the text in a usual way, like this:
self._pdf[0].get_textpage().get_text_range()
It has to be an environment related issue. First I seek out to understand the problem, but the goal is to use code, which is platform independent (independent of python version, IDE, etc).
Do you have any ideas?
Install Info
Validity