Open wwaguai opened 1 year ago
Interesting. The first check passed: I can copy the text without issues with the Chrome PDF viewer.
pdfium2
gives:
2
未經審核
截至下列日期止六個月
二零二三年
六月三十日
二零二二年
六月三十日 同比變動
(人民幣百萬元,另有指明者除外)
收入 299,194 269,505 11%
毛利 139,022 114,941 21%
經營盈利 80,729 67,284 20%
期內盈利 53,417 42,963 24%
本公司權益持有人應佔盈利 52,009 42,032 24%
每股盈利(每股人民幣元)
-基本 5.486 4.407 24%
-攤薄 5.334 4.320 23%
非國際財務報告準則經營盈利 98,511 73,205 35%
非國際財務報告準則本公司權益持有人應佔盈利 70,086 53,684 31%
非國際財務報告準則每股盈利(每股人民幣元)
-基本 7.393 5.628 31%
-攤薄 7.236 5.516 31%
So it definitely is a shortcoming of pypdf. Thanks for sharing!
@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?
@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?
it's publicly available that you can download from internet, I think it can be used there for testing
@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?
it's publicly available that you can download from internet, I think it can be used there for testing
There is a difference between publicly available files which we are already using for regular testing and the files from the sample-files
repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.
@wwaguai Do you own the copyright on caibao.pdf? May I add it to https://github.com/py-pdf/sample-files so that we can use it for testing?
it's publicly available that you can download from internet, I think it can be used there for testing
There is a difference between publicly available files which we are already using for regular testing and the files from the
sample-files
repository, which are subject to a Creative Commons license you usually can provide if you are the owner/creator of the file only.
https://static.www.tencent.com/uploads/2023/08/29/1d726a2226130c610975c21480cf1890.PDF you can probably reproduce using this file (it's Tencent's financial report, same as where we got the sample), that said, I feel like it's not under Creative Common License, and sorry, appearently I'm not the creator of it. It can be reproduced if you use the font: MHeiHK-Bold, however I do not have copyright for that font so not sure if that can be used for this case. That said here's a very simple example using that: caibao2.pdf
I found where the font character code is stored. 842 is obtained during analysis, so I think it should be possible if you consider where to get it when mapping.
When creating a font mapping, it seems to skip everything if there is no /ToUnicode, so it looks like that area needs to be changed. (Code in parse_to_unicode
)
I found where the font character code is stored. 842 is obtained during analysis, so I think it should be possible if you consider where to get it when mapping. When creating a font mapping, it seems to skip everything if there is no /ToUnicode, so it looks like that area needs to be changed. (Code in
parse_to_unicode
)
@ssjkamei Can you tell me which pdf internal structure viewer are you using ?
Can you tell me which pdf internal structure viewer are you using ?
I am using Adobe Acrobat.
Hi there, we're trying to utilize this cool library to extract text for some processing, but it seems it failed on the attached PDF. It contains some Traditional Chinese characters but the output looks like some random characters.
Looks like this PDF is utilizing CFF based CIDFontType0C as subtype, wondering if that's not currently supported by pypdf? Let us know if there's anything we can help as well. Not super familiar but happy to help out.
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
PDF: caibao.pdf