jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

Got different result of "page.to_image()" on MacOS and Linux #1115

Open zqkou opened 3 months ago

zqkou commented 3 months ago

Describe the bug

I am trying to open one PDF file, get the first page and convert to thumbnail image, but got different result on MacOS and Linux.

Have you tried repairing the PDF?

No. The linux environment is cloud based, I am not authorized to install additional package.

Code to reproduce the problem

import io
from base64 import encodebytes
import pdfplumber

doc_path = <PATH_STR>
pdf = pdfplumber.open(doc_path)
p0 = pdf.pages[0]
img = p0.to_image(resolution=10)
img.show()

# or output image as base64 encoded string
byte_arr = io.BytesIO()
img.save(byte_arr, format="PNG")
encoded_img = encodebytes(byte_arr.getvalue()).decode("ascii")
print(encoded_img)

PDF file

page1~3.pdf

Expected behavior

The image should be identical on both platforms.

Actual behavior

Some text are missing from the thumbnail image got from Linux platform.

Screenshots

From MacOS: image

From Linux: image

Environment

Additional context

For those "missing text characters", I can get those text objects (as below) by calling extract_text(layout=False) API, on both platforms. Only the image generated from MacOS and Linux are different.

随着经济全球化和知识经济时代的到来,无国界化企业经营
的趋势愈来愈明显,整个市场竞争呈现出明显的国际化和一体
化。与此同时,用户需求愈加突出个性化,导致不确定性不断
增加。此外,高新技术的迅猛发展提高了生产效率,缩短了产
品更新换代周期,加剧了市场竞争的激烈程度。因此,企业管
理如何适应新的竞争环境已成为企业家和理论工作者关注的焦
点。本章讨论了2 1世纪市场竞争环境的主要特征,指出传统管
理模式存在的问题并对此进行了讨论,最后介绍了供应链管理
产生的背景及发展的主要趋势。
jsvine commented 3 months ago

Does this only occur for resolution=10? Or also for higher/standard resolutions?

zqkou commented 3 months ago

Does this only occur for resolution=10? Or also for higher/standard resolutions?

Thanks for looking into this issue. I set resolution to 10 for the purpose of reasonable size of base64 string. It is more easy to compare.

We got the same issue for other resolution number.

jsvine commented 3 months ago

Thanks for the extra information. pdfplumber delegates image generation to pypdfium2, so the only control we have over the rendering is the keyword arguments we pass to pdfium_page.render:

https://github.com/jsvine/pdfplumber/blob/147f2c4c07dc1191fc1d05bb589b4f6af3aaf74a/pdfplumber/display.py#L59-L67

I'm not sure whether (or not) pypdfium2 makes any guarantees that this method returns the same values on all platforms.

zqkou commented 3 months ago

Thanks for the extra information. pdfplumber delegates image generation to pypdfium2, so the only control we have over the rendering is the keyword arguments we pass to pdfium_page.render:

https://github.com/jsvine/pdfplumber/blob/147f2c4c07dc1191fc1d05bb589b4f6af3aaf74a/pdfplumber/display.py#L59-L67

I'm not sure whether (or not) pypdfium2 makes any guarantees that this method returns the same values on all platforms.

Thanks for the information. I will try to follow up with pypdfium2 issue.

And issue has been created to pypdfium2 https://github.com/pypdfium2-team/pypdfium2/issues/304

mara004 commented 3 months ago

PDFs may depend on system fonts if not embedded. Then for the PDF to render correctly it is important that these be installed. Sometimes fonts may be missing on Linux due to different defaults, or licensing issues.

In this case, it seems like the PDF is looking for a set of fonts named FZ*--GB1-0. Noto Sans CJK might work as a possible substitute - please check that.

FWIW, the PDF seems to render correctly on my Fedora 37 device.

zqkou commented 3 months ago

PDFs may depend on system fonts if not embedded. Then for the PDF to render correctly it is important that these be installed. Sometimes fonts may be missing on Linux due to different defaults, or licensing issues.

In this case, it seems like the PDF is looking for a set of fonts named FZ*--GB1-0. Noto Sans CJK might work as a possible substitute - please check that.

FWIW, the PDF seems to render correctly on my Fedora 37 device.

Thanks a lot for reply.

You reminder me one thing. The linux OS in issue is actually POD running in K8S cluster, which is protected by network policy, means unauthorized access to external network is restricted.

Is it possible that the missing FONT or CMAP info need to be retrieved from Internet?

If this is one possible fix I will try to apply network policy update. But need to figure out which specific network address is needed.

mara004 commented 3 months ago

Is it possible that the missing FONT or CMAP info need to be retrieved from Internet?

pdfium never tries to establish any network connection, to my best knowledge. However, installing a system font such as Noto Sans CJK may require network access, but presumably there will be some way to install packages in that runner.