Cannot extract text correctly for some CJK fonts

wwaguai commented 1 year ago

Hi there, we're trying to utilize this cool library to extract text for some processing, but it seems it failed on the attached PDF. It contains some Traditional Chinese characters but the output looks like some random characters.

Looks like this PDF is utilizing CFF based CIDFontType0C as subtype, wondering if that's not currently supported by pypdf? Let us know if there's anything we can help as well. Not super familiar but happy to help out.

Environment

Which environment were you using when you encountered the problem?

$ python3 -m platform
macOS-13.5-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("caibao.pdf")
number_of_pages = len(reader.pages)
for i in range(number_of_pages): 
    page = reader.pages[i]
    text = page.extract_text()
    print(text)

Output:
2ࣨ
˜
ɚཧɚɧϋ
ʬ˜ɧɤ˚ɚཧɚɚϋ
ʬ˜ɧɤ˚ Νˢᜊਗ
ৰ̮
ϗɝ 299,194 269,505 11%
ˣл 139,022 114,941 21%
л 80,729 67,284 20%
л 53,417 42,963 24%
л 52,009 42,032 24%
ɛ͏࿆ʩ
 Ñਿ͉ 5.486 4.407 24%
 Ñᛅᑛ 5.334 4.320 23%
л 98,511 73,205 35%
л 70,086 53,684 31%
ɛ͏࿆ʩ
 Ñਿ͉ 7.393 5.628 31%
 Ñᛅᑛ 7.236 5.516 31%

PDF: caibao.pdf

MartinThoma commented 1 year ago

Interesting. The first check passed: I can copy the text without issues with the Chrome PDF viewer.

pdfium2 gives:

2
未經審核
截至下列日期止六個月
二零二三年
六月三十日
二零二二年
六月三十日 同比變動
（人民幣百萬元，另有指明者除外）
收入 299,194 269,505 11%
毛利 139,022 114,941 21%
經營盈利 80,729 67,284 20%
期內盈利 53,417 42,963 24%
本公司權益持有人應佔盈利 52,009 42,032 24%
每股盈利（每股人民幣元）
－基本 5.486 4.407 24%
－攤薄 5.334 4.320 23%
非國際財務報告準則經營盈利 98,511 73,205 35%
非國際財務報告準則本公司權益持有人應佔盈利 70,086 53,684 31%
非國際財務報告準則每股盈利（每股人民幣元）
－基本 7.393 5.628 31%
－攤薄 7.236 5.516 31%

So it definitely is a shortcoming of pypdf. Thanks for sharing!

MartinThoma commented 1 year ago

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

wwaguai commented 1 year ago

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

stefan6419846 commented 1 year ago

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

There is a difference between publicly available files which we are already using for regular testing and the files from the sample-files repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.

wwaguai commented 1 year ago

@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?

it's publicly available that you can download from internet, I think it can be used there for testing

There is a difference between publicly available files which we are already using for regular testing and the files from the sample-files repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.

https://static.www.tencent.com/uploads/2023/08/29/1d726a2226130c610975c21480cf1890.PDF you can probably reproduce using this file (it's Tencent's financial report, same as where we got the sample), that said, I feel like it's not under Creative Common License, and sorry, appearently I'm not the creator of it. It can be reproduced if you use the font: MHeiHK-Bold, however I do not have copyright for that font so not sure if that can be used for this case. That said here's a very simple example using that: caibao2.pdf

ssjkamei commented 1 month ago

I found where the font character code is stored. 842 is obtained during analysis, so I think it should be possible if you consider where to get it when mapping. When creating a font mapping, it seems to skip everything if there is no /ToUnicode, so it looks like that area needs to be changed. (Code in parse_to_unicode)

pubpub-zz commented 1 month ago

I found where the font character code is stored. 842 is obtained during analysis, so I think it should be possible if you consider where to get it when mapping. When creating a font mapping, it seems to skip everything if there is no /ToUnicode, so it looks like that area needs to be changed. (Code in parse_to_unicode)

@ssjkamei Can you tell me which pdf internal structure viewer are you using ?

ssjkamei commented 1 month ago

Can you tell me which pdf internal structure viewer are you using ?

I am using Adobe Acrobat.

py-pdf / pypdf

Cannot extract text correctly for some CJK fonts #2295

Environment

Code + PDF