In some PDF files, cmap data could not be read correctly.
This was due to unintentional whitespace in the filename used to read the cmap file.
This fix will allow cmap to be read correctly in some PDF files.
The PDF file where this occurs is bs104761.pdf in #934.
The PDF contained whitespace in cid_registry and cid_ordering.
Therefore, strip() was used to remove the whitespace characters.
How Has This Been Tested?
With the corrected version, the text extraction can be performed correctly in bs104761.pdf.
$ python tools/pdf2txt.py bs104761.pdf | head
WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='bs104761.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
商号 アルファテックス株式会社
貸 借 対 照 表
令 和 3 年 3 月 31 日 現 在
代表者 石川 春
科 目
産
Behavior before modification:
$ python tools/pdf2txt.py bs104761.pdf | head
WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='bs104761.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
(cid:2446)(cid:2040)(cid:633)(cid:926)(cid:999)(cid:977)(cid:925)(cid:962)(cid:959)(cid:939)(cid:949)(cid:1490)(cid:2268)(cid:1393)(cid:2302)
(cid:2879) (cid:2310) (cid:2864) (cid:2480) (cid:3503)
(cid:4009) (cid:4072) (cid:250) (cid:3301) (cid:250) (cid:1860) (cid:250)(cid:248) (cid:3284) (cid:1905) (cid:2127)
(cid:2885)(cid:3503)(cid:2304)(cid:231)(cid:2676)(cid:2706)(cid:633)(cid:2399)
(cid:1354)(cid:633)(cid:633)(cid:633)(cid:633)(cid:633)(cid:3816)
(cid:2184)
I find more unintentional characters in cid_registry and cid_ordering, such as \x0b and \r , but the strip() is not able to remove these characters. Is there any other solution for this situation?
Pull request
In some PDF files, cmap data could not be read correctly. This was due to unintentional whitespace in the filename used to read the cmap file. This fix will allow cmap to be read correctly in some PDF files.
The PDF file where this occurs is bs104761.pdf in #934.
The PDF contained whitespace in
cid_registry
andcid_ordering
.cid_registry:
Adobe
cid_ordering:Japan1\n\n\n\n\n\n\n\n\n\n
Therefore,
strip()
was used to remove the whitespace characters.How Has This Been Tested?
With the corrected version, the text extraction can be performed correctly in bs104761.pdf.
Behavior before modification:
Checklist