pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
425 stars 17 forks source link

Parsing special characters leading to inconsistency among different machines #289

Closed quikssb closed 10 months ago

quikssb commented 10 months ago

Package origin

Description

I am parsing text out of a pdf file, see: pdftemp

It contains a special character, which is °. When parsing however I get the following value 20~25ÅC instead of 20-25°C. So ° gets parsed as Å. Strangely enough my colleague is using exactly the same code, and he gets parsed out the value as intended.

We read the .pdf in binary mode (not doing that leads to other exceptions):


    with open(
        "/my/path/mypdf.pdf",
        "rb"
    ) as f:
        b = f.read()

and extract the text in a usual way, like this: self._pdf[0].get_textpage().get_text_range()

It has to be an environment related issue. First I seek out to understand the problem, but the goal is to use code, which is platform independent (independent of python version, IDE, etc).

Do you have any ideas?

Install Info

pypdfium2 4.25.0
pdfium 121.0.6164.0 at /home/marcel/.local/lib/python3.10/site-packages/pypdfium2_raw/libpdfium.so
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Linux-6.1.0-1027-oem-x86_64-with-glibc2.35
Name: pypdfium2
Version: 4.25.0
Summary: Python bindings to PDFium
Home-page: https://github.com/pypdfium2-team/pypdfium2
Author: pypdfium2-team
Author-email: geisserml@gmail.com
License: (Apache-2.0 OR BSD-3-Clause) AND LicenseRef-PdfiumThirdParty
Location: /home/marcel/.local/lib/python3.10/site-packages
Requires: 
Required-by:

Validity

quikssb commented 10 months ago

Ok got it now...

The green marking (which was done after creating the .pdf) caused this issue. You never stop learning.. Sorry for opening this issue.

mara004 commented 10 months ago

No problem. Was the file with the green marking corrupted, or is pdfium behaving incorrectly?

We read the .pdf in binary mode (not doing that leads to other exceptions):

Why don't you just pass the input filepath directly to the PdfDocument constructor -- or does that not work? Note, you cannot open a PDF from a non-filepath string. If inputting in-memory data, it must be binary (PDF programmatically is not a text format, so it cannot be validly decoded).

quikssb commented 10 months ago

No problem. Was the file with the green marking corrupted, or is pdfium behaving incorrectly?

We read the .pdf in binary mode (not doing that leads to other exceptions):

Why don't you just pass the input filepath directly to the PdfDocument constructor -- or does that not work? Note, you cannot open a PDF from a non-filepath string. If inputting in-memory data, it must be binary (PDF programmatically is not a text format, so it cannot be validly decoded).

It seems the green marking makes those special characters such as ° or ± hard to detect for the library. I am not even sure, if the library is to blame here at this point though.

When trying to read the pdf file (which is locally stored on hard drive) with a usual input path string, but not in binary mode, I got some exceptions when reading out the pdf-file. I don't remember anymore the exact exceptions. Anyways thanks for replying, it's all good now.

mara004 commented 10 months ago

When trying to read the pdf file (which is locally stored on hard drive) with a usual input path string, but not in binary mode, I got some exceptions when reading out the pdf-file. I don't remember anymore the exact exceptions. Anyways thanks for replying, it's all good now.

Yet I'd like to know what is at stake here. All input modes are supposed to work, and FWIW they seem to do in our test suite. If one input mode works but another doesn't, a traceback + steps to reproduce might be valuable information for us. Are you able to change back to a string path (that has to be passed directly to PdfDocument() without open()) and restore the exception with the file in question?

mara004 commented 10 months ago

As an aside, the original issue title reminds me much of https://github.com/pypdfium2-team/pypdfium2/discussions/288 (which incidentally appeared at almost the same time).

Inconsistent rendering or text extraction across different machines can hint at a missing system font, if the pdf in question is not embedding it.

quikssb commented 10 months ago

When trying to read the pdf file (which is locally stored on hard drive) with a usual input path string, but not in binary mode, I got some exceptions when reading out the pdf-file. I don't remember anymore the exact exceptions. Anyways thanks for replying, it's all good now.

Yet I'd like to know what is at stake here. All input modes are supposed to work, and FWIW they seem to do in our test suite. If one input mode works but another doesn't, a traceback + steps to reproduce might be valuable information for us. Are you able to change back to a string path (that is passed directly to PdfDocument() without open()) and restore the exception with the file in question?

Will try to get back to you tomorrow at work.