py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.32k stars 1.41k forks source link

Implementation of advanced cmap encodings #2356

Closed stefan6419846 closed 2 months ago

stefan6419846 commented 10 months ago

Currently, I am trying to extract text from PDF files which partially report some warnings like

/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_cmap.py:183: PdfReadWarning: Advanced encoding /GBK2K-H not implemented yet
  warnings.warn(
/home/stefan/temp/venv/lib/python3.9/site-packages/pypdf/_cmap.py:183: PdfReadWarning: Advanced encoding /GBK2K-V not implemented yet
  warnings.warn(

I have seen this for the both encodings mentioned above and for /StandardEncoding.

Digging through the available resources related to the GBK2K cmaps, I found some Adobe resources as well as the implementation from pdfminer.six, which ships some custom pickled files derived from the Adobe open source components to handle such cases.

Is there any guidance available on how to tackle this or how we would like to see this added to pypdf?

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.3, crypt_provider=('pycryptodome', '3.18.0'), PIL=10.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('file.pdf')
page = reader.pages[0]
print(page.extract_text())

For now, I have no uncritical file I could share here. Looking at the example file, it seems like in this case it is a scan of a document (from a Canon device?) with Latin characters with wrongly configured or strange OCR, yielding a mix of Latin and Chinese characters inside the text layer.

Traceback

warnings.warn as currently used only prints the pypdf code line this occurred, thus there is not much of a traceback.

MartinThoma commented 10 months ago

Is there any guidance available on how to tackle this or how we would like to see this added to pypdf?

No, there is none. I guess only @pubpub-zz can help you with that.

pubpub-zz commented 10 months ago

@stefan6419846 try to modify _cmap.py with

_predefined_cmap: Dict[str, str] = {
    "/Identity-H": "utf-16-be",
    "/Identity-V": "utf-16-be",
    "/GB-EUC-H": "gbk",  # TBC
    "/GB-EUC-V": "gbk",  # TBC
    "/GBpc-EUC-H": "gb2312",  # TBC
    "/GBpc-EUC-V": "gb2312",  # TBC
    "/GBK-EUC-H": "gbk",  # TBC
    "/GBK-EUC-V": "gbk",  # TBC
    "/GBK2K-H": "gb18030",  # <- new
    "/GBK2K-V": "gb18030", # <- new
    # UCS2 in code
}
stefan6419846 commented 10 months ago

@pubpub-zz Thanks for pointing this out. It seems to indeed work.

When looking at this, two questions arose for me:

actuary-chen commented 4 months ago

@stefan6419846 try to modify _cmap.py with

_predefined_cmap: Dict[str, str] = {
    "/Identity-H": "utf-16-be",
    "/Identity-V": "utf-16-be",
    "/GB-EUC-H": "gbk",  # TBC
    "/GB-EUC-V": "gbk",  # TBC
    "/GBpc-EUC-H": "gb2312",  # TBC
    "/GBpc-EUC-V": "gb2312",  # TBC
    "/GBK-EUC-H": "gbk",  # TBC
    "/GBK-EUC-V": "gbk",  # TBC
    "/GBK2K-H": "gb18030",  # <- new
    "/GBK2K-V": "gb18030", # <- new
    # UCS2 in code
}

Similiar issues for "/UniCNS-UTF16-H" , "/ETen-B5-H" , "/ETen-B5-V", "/ETenms-B5-H" , how to modify _cmap?

pubpub-zz commented 4 months ago

@actuary-chen can you please share your pdf for analysis?

actuary-chen commented 4 months ago

Hi,

Maybe regards these two files.

Benjamin

pubpub-zz @.***> 於 2024年6月19日 週三 下午7:26寫道:

@actuary-chen https://github.com/actuary-chen can you please share your pdf for analysis?

— Reply to this email directly, view it on GitHub https://github.com/py-pdf/pypdf/issues/2356#issuecomment-2178440342, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEO7QJBBCNVEMFERM4HCEWDZIFTHZAVCNFSM6AAAAABA7VBGLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGQ2DAMZUGI . You are receiving this because you were mentioned.Message ID: @.***>

pubpub-zz commented 4 months ago

@actuary-chen the files are not attached. Please attach them directly in the thread

actuary-chen commented 4 months ago

FBL01-1.pdf FBL01-2.pdf

The issues are maybe from such as the attached files

pubpub-zz commented 4 months ago

@actuary-chen this is the updated table. your files were not containing UniCNS-UTF16-H can you check it is ok with the new table?

_predefined_cmap: Dict[str, str] = {
    "/Identity-H": "utf-16-be",
    "/Identity-V": "utf-16-be",
    "/GB-EUC-H": "gbk",  # TBC
    "/GB-EUC-V": "gbk",  # TBC
    "/GBpc-EUC-H": "gb2312",  # TBC
    "/GBpc-EUC-V": "gb2312",  # TBC
    "/GBK-EUC-H": "gbk",  # TBC
    "/GBK-EUC-V": "gbk",  # TBC
    "/GBK2K-H": "gb18030",
    "/GBK2K-V": "gb18030",
    "/ETen-B5-H": "cp950",
    "/ETen-B5-V": "cp950",
    "/ETenms-B5-H": "cp950",
    "/ETenms-B5-V": "cp950",
    "/UniCNS-UTF16-H": "utf-16-be", # TBC
    "/UniCNS-UTF16-V": "utf-16-be", # TBC
    # UCS2 in code
}
pubpub-zz commented 2 months ago

This issue seems solved. Don't know why it has not been closed automatically

stefan6419846 commented 2 months ago

This has not been closed before as I was looking for a generic solution for implementing all possible encodings in one step instead of opening a new issue for each one.

pubpub-zz commented 2 months ago

we need to check the encodings. I can not see a global solutoin