Closed svaaraniemi closed 7 months ago
Edit: Probably my understanding was incorrect and this is a different issue
Thanks for the report, particularly the investigation with pdfium_test
.
There have been various similar issues in the past, e.g. https://github.com/pypdfium2-team/pypdfium2/issues/304 or https://github.com/pypdfium2-team/pypdfium2/discussions/288.
The problem, I think, is missing system fonts. If you pass --no-system-fonts
to pdfium_test
, I believe it will use an own (and probably more comprehensive) fonts corpus, disregarding system fonts.
So you'd have to identify which font is missing in the system, and install it.
I already wondered if we should ship with some fonts, and feed them to pdfium by calling the relevant APIs. But then the question is: Which fonts, and what impact will that have on package size, not to mention licensing? As you say it works in Chrome (Chromium?), I'd be intrigued what fonts they are bundling?
Another question is whether this is a task for pypdfium2 itself, or maybe rather a complementary package to be provided by a third party. Personally, I'm -1 on including fonts in pypdfium2.
Hmm, I just browsed the pdfium_test sources, and maybe I was on the wrong track here. Missing system fonts is a common problem, but this might be a different issue.
The option --no-system-fonts
seems to impact m_pUserFontPaths
in an other way than I thought.
IIUC --no-system-fonts
would lead to m_pUserFontPaths = [None, None]
in the pdfium config, whereas default is m_pUserFontPaths = None
. So it seems like it really just disables system fonts.
Then the above behavior seems weird to me.
As to the pdfium config, you can import pypdfium2_raw
, which has no auto-init, and set up the config on your own, but then you're limited to raw APIs.
The main pypdfium2
always auto-inits and there isn't currently a way to customize that, unfortunately.
But maybe you can call FPDF_DestroyLibrary()
and re-initialize with your own config.
OK, confirmed I can reproduce your pdfium_test findings with pypdfium2.
By default, pypdfium2 renders an a
. But when patching library init as follows, it renders an alpha char:
diff --git a/src/pypdfium2/_library_scope.py b/src/pypdfium2/_library_scope.py
index d66daf21..15e61ea9 100644
--- a/src/pypdfium2/_library_scope.py
+++ b/src/pypdfium2/_library_scope.py
@@ -3,9 +3,11 @@
import atexit
import os, sys
+from ctypes import POINTER, c_char
import pypdfium2.raw as pdfium_c
import pypdfium2.internal as pdfium_i
+FONTS = (POINTER(c_char) * 2)(*[None, None])
def init_lib():
assert not pdfium_i.LIBRARY_AVAILABLE
@@ -16,7 +18,7 @@ def init_lib():
# NOTE Technically, FPDF_InitLibrary() would be sufficient for our purposes, but since it's formally marked for deprecation, don't use it to be on the safe side. Also, avoid experimental config versions that might not be promoted to stable.
config = pdfium_c.FPDF_LIBRARY_CONFIG(
version = 2,
- m_pUserFontPaths = None,
+ m_pUserFontPaths = FONTS,
m_pIsolate = None,
m_v8EmbedderSlot = 0,
# m_pPlatform = None, # v3
However, generally disabling system font search does not sound like a solution. I'd suggest filing a pdfium bug for this?
Thanks - I was thinking of the same "solution" of disabling system fonts, and I'm coming to the same conclusion that it's probably not a good idea.
I'll dig into it a bit more but I'll go ahead and file this with pdfium unless I discover something new.
I'm closing this ticket as I think your investigation is conclusive.
Thanks again!
Thank you, and apologies for having misunderstood the issue initially. This just astonished me, system fonts actually harming the output. I thought pdfium would always prioritize embedded fonts, but nevermind.
If you do the bug report, feel free to post the pdfium ticket ID here, then I'll keep an eye on it. Thanks!
Checklist
pypdfium2-team
andbblanchon
channels.Description
I have a PDF which has an alpha character which is extracted as an ascii a character when I call get_text_bounded or get_text_range. The Chrome PDF renderer shows it as alpha character.
I believe this has to do with embedded fonts because my locally built pdfium_test also shows it as ascii a by default. But if I run it with --no-system-fonts like so:
./out/Debug/pdfium_test --no-system-fonts --png ./page-with-alpha.pdf
then the character is rendered correctly as alpha into the png image.
I'm posting this issue here in the pypdfium2 project as I wonder if there is a way to expose that --no-system-fonts to pypdfium2? Or maybe there is another way to get the special characters extracted correctly with pypdfium2? I attached the PDF page and two png files which were generated with pdfium_test, one with --no-system-fonts and the other without the argument. page-with-alpha.pdf
Thanks, Sami
Install Info
Validity