pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

Fix #934: Create correct cidcoding name #935

Closed aoking closed 10 months ago

aoking commented 10 months ago

Pull request

In some PDF files, cmap data could not be read correctly. This was due to unintentional whitespace in the filename used to read the cmap file. This fix will allow cmap to be read correctly in some PDF files.

The PDF file where this occurs is bs104761.pdf in #934.

The PDF contained whitespace in cid_registry and cid_ordering.

cid_registry: Adobe cid_ordering: Japan1\n\n\n\n\n\n\n\n\n\n

Therefore, strip() was used to remove the whitespace characters.

How Has This Been Tested?

With the corrected version, the text extraction can be performed correctly in bs104761.pdf.

$ python tools/pdf2txt.py bs104761.pdf | head
WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='bs104761.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
商号 アルファテックス株式会社

貸 借 対 照 表

令 和 3 年 3 月 31 日 現 在

代表者 石川 春

科     目
産

Behavior before modification:

$ python tools/pdf2txt.py bs104761.pdf | head
WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='bs104761.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
(cid:2446)(cid:2040)(cid:633)(cid:926)(cid:999)(cid:977)(cid:925)(cid:962)(cid:959)(cid:939)(cid:949)(cid:1490)(cid:2268)(cid:1393)(cid:2302)

(cid:2879) (cid:2310) (cid:2864) (cid:2480) (cid:3503)

(cid:4009) (cid:4072) (cid:250) (cid:3301) (cid:250) (cid:1860) (cid:250)(cid:248) (cid:3284) (cid:1905) (cid:2127)

(cid:2885)(cid:3503)(cid:2304)(cid:231)(cid:2676)(cid:2706)(cid:633)(cid:2399)

(cid:1354)(cid:633)(cid:633)(cid:633)(cid:633)(cid:633)(cid:3816)
(cid:2184)

Checklist

synceokhou commented 6 days ago

I find more unintentional characters in cid_registry and cid_ordering, such as \x0b and \r , but the strip() is not able to remove these characters. Is there any other solution for this situation?