Open zqkou opened 3 months ago
Does this only occur for resolution=10
? Or also for higher/standard resolutions?
Does this only occur for
resolution=10
? Or also for higher/standard resolutions?
Thanks for looking into this issue. I set resolution to 10 for the purpose of reasonable size of base64 string. It is more easy to compare.
We got the same issue for other resolution number.
Thanks for the extra information. pdfplumber
delegates image generation to pypdfium2
, so the only control we have over the rendering is the keyword arguments we pass to pdfium_page.render
:
I'm not sure whether (or not) pypdfium2
makes any guarantees that this method returns the same values on all platforms.
Thanks for the extra information.
pdfplumber
delegates image generation topypdfium2
, so the only control we have over the rendering is the keyword arguments we pass topdfium_page.render
:I'm not sure whether (or not)
pypdfium2
makes any guarantees that this method returns the same values on all platforms.
Thanks for the information. I will try to follow up with pypdfium2
issue.
And issue has been created to pypdfium2
https://github.com/pypdfium2-team/pypdfium2/issues/304
PDFs may depend on system fonts if not embedded. Then for the PDF to render correctly it is important that these be installed. Sometimes fonts may be missing on Linux due to different defaults, or licensing issues.
In this case, it seems like the PDF is looking for a set of fonts named FZ*--GB1-0
.
Noto Sans CJK
might work as a possible substitute - please check that.
FWIW, the PDF seems to render correctly on my Fedora 37 device.
PDFs may depend on system fonts if not embedded. Then for the PDF to render correctly it is important that these be installed. Sometimes fonts may be missing on Linux due to different defaults, or licensing issues.
In this case, it seems like the PDF is looking for a set of fonts named
FZ*--GB1-0
.Noto Sans CJK
might work as a possible substitute - please check that.FWIW, the PDF seems to render correctly on my Fedora 37 device.
Thanks a lot for reply.
You reminder me one thing. The linux OS in issue is actually POD running in K8S cluster, which is protected by network policy, means unauthorized access to external network is restricted.
Is it possible that the missing FONT or CMAP info need to be retrieved from Internet?
If this is one possible fix I will try to apply network policy update. But need to figure out which specific network address is needed.
Is it possible that the missing FONT or CMAP info need to be retrieved from Internet?
pdfium never tries to establish any network connection, to my best knowledge. However, installing a system font such as Noto Sans CJK may require network access, but presumably there will be some way to install packages in that runner.
Describe the bug
I am trying to open one PDF file, get the first page and convert to thumbnail image, but got different result on MacOS and Linux.
Have you tried repairing the PDF?
No. The linux environment is cloud based, I am not authorized to install additional package.
Code to reproduce the problem
PDF file
page1~3.pdf
Expected behavior
The image should be identical on both platforms.
Actual behavior
Some text are missing from the thumbnail image got from Linux platform.
Screenshots
From MacOS:![image](https://github.com/jsvine/pdfplumber/assets/20944084/7c315557-1fe3-4167-ac22-9c08f3d0eee0)
From Linux:![image](https://github.com/jsvine/pdfplumber/assets/20944084/5d167f11-b193-48c4-be77-65e2c47d9a1d)
Environment
Additional context
For those "missing text characters", I can get those text objects (as below) by calling
extract_text(layout=False)
API, on both platforms. Only the image generated from MacOS and Linux are different.