pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
349 stars 15 forks source link

Extracting text with special characters #306

Closed svaaraniemi closed 5 months ago

svaaraniemi commented 5 months ago

Checklist

Description

I have a PDF which has an alpha character which is extracted as an ascii a character when I call get_text_bounded or get_text_range. The Chrome PDF renderer shows it as alpha character.

I believe this has to do with embedded fonts because my locally built pdfium_test also shows it as ascii a by default. But if I run it with --no-system-fonts like so:

./out/Debug/pdfium_test --no-system-fonts --png ./page-with-alpha.pdf

then the character is rendered correctly as alpha into the png image.

I'm posting this issue here in the pypdfium2 project as I wonder if there is a way to expose that --no-system-fonts to pypdfium2? Or maybe there is another way to get the special characters extracted correctly with pypdfium2? I attached the PDF page and two png files which were generated with pdfium_test, one with --no-system-fonts and the other without the argument. page-with-alpha.pdf pdfium-render default pdfium-render no-system-fonts

Thanks, Sami

Install Info

pypdfium2 4.26.0                                                                                                    
pdfium 122.0.6233.0 at /home/sami/miniconda3/envs/python3.10/lib/python3.10/site-packages/pypdfium2_raw/libpdfium.so
/home/sami/miniconda3/envs/python3.10/bin/python                                                                    
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]                                                           
Linux-6.5.0-25-generic-x86_64-with-glibc2.35                                                                        
WARNING: Package(s) not found: pypdfium2_raw                                                                        
WARNING: Package(s) not found: pypdfium2_helpers                                                                    
# packages in environment at /home/sami/miniconda3/envs/python3.10:                                                 
#                                                                                                                   
# Name                    Version                   Build  Channel                                                  
pypdfium2                 4.26.0                   pypi_0    pypi                                                   
==> /home/sami/.condarc <==                                                                                         
channels:                                                                                                           
  - microsoft                                                                                                       
  - conda-forge                                                                                                     
  - defaults                                                                                                        

==> cmd_line <==                                                                                                    
debug: False                                                                                                        
json: False

Validity

mara004 commented 5 months ago

Edit: Probably my understanding was incorrect and this is a different issue

Thanks for the report, particularly the investigation with pdfium_test. There have been various similar issues in the past, e.g. https://github.com/pypdfium2-team/pypdfium2/issues/304 or https://github.com/pypdfium2-team/pypdfium2/discussions/288.

The problem, I think, is missing system fonts. If you pass --no-system-fonts to pdfium_test, I believe it will use an own (and probably more comprehensive) fonts corpus, disregarding system fonts. So you'd have to identify which font is missing in the system, and install it.

I already wondered if we should ship with some fonts, and feed them to pdfium by calling the relevant APIs. But then the question is: Which fonts, and what impact will that have on package size, not to mention licensing? As you say it works in Chrome (Chromium?), I'd be intrigued what fonts they are bundling?

Another question is whether this is a task for pypdfium2 itself, or maybe rather a complementary package to be provided by a third party. Personally, I'm -1 on including fonts in pypdfium2.

mara004 commented 5 months ago

Hmm, I just browsed the pdfium_test sources, and maybe I was on the wrong track here. Missing system fonts is a common problem, but this might be a different issue.

The option --no-system-fonts seems to impact m_pUserFontPaths in an other way than I thought. IIUC --no-system-fonts would lead to m_pUserFontPaths = [None, None] in the pdfium config, whereas default is m_pUserFontPaths = None. So it seems like it really just disables system fonts.

Then the above behavior seems weird to me.

mara004 commented 5 months ago

As to the pdfium config, you can import pypdfium2_raw, which has no auto-init, and set up the config on your own, but then you're limited to raw APIs. The main pypdfium2 always auto-inits and there isn't currently a way to customize that, unfortunately.

But maybe you can call FPDF_DestroyLibrary() and re-initialize with your own config.

mara004 commented 5 months ago

OK, confirmed I can reproduce your pdfium_test findings with pypdfium2. By default, pypdfium2 renders an a. But when patching library init as follows, it renders an alpha char:

diff --git a/src/pypdfium2/_library_scope.py b/src/pypdfium2/_library_scope.py
index d66daf21..15e61ea9 100644
--- a/src/pypdfium2/_library_scope.py
+++ b/src/pypdfium2/_library_scope.py
@@ -3,9 +3,11 @@

 import atexit
 import os, sys
+from ctypes import POINTER, c_char
 import pypdfium2.raw as pdfium_c
 import pypdfium2.internal as pdfium_i

+FONTS = (POINTER(c_char) * 2)(*[None, None])

 def init_lib():
     assert not pdfium_i.LIBRARY_AVAILABLE
@@ -16,7 +18,7 @@ def init_lib():
     # NOTE Technically, FPDF_InitLibrary() would be sufficient for our purposes, but since it's formally marked for deprecation, don't use it to be on the safe side. Also, avoid experimental config versions that might not be promoted to stable.
     config = pdfium_c.FPDF_LIBRARY_CONFIG(
         version = 2,
-        m_pUserFontPaths = None,
+        m_pUserFontPaths = FONTS,
         m_pIsolate = None,
         m_v8EmbedderSlot = 0,
         # m_pPlatform = None,  # v3
mara004 commented 5 months ago

However, generally disabling system font search does not sound like a solution. I'd suggest filing a pdfium bug for this?

svaaraniemi commented 5 months ago

Thanks - I was thinking of the same "solution" of disabling system fonts, and I'm coming to the same conclusion that it's probably not a good idea.

I'll dig into it a bit more but I'll go ahead and file this with pdfium unless I discover something new.

I'm closing this ticket as I think your investigation is conclusive.

Thanks again!

mara004 commented 5 months ago

Thank you, and apologies for having misunderstood the issue initially. This just astonished me, system fonts actually harming the output. I thought pdfium would always prioritize embedded fonts, but nevermind.

If you do the bug report, feel free to post the pdfium ticket ID here, then I'll keep an eye on it. Thanks!