Bug: pdfminer.six fails to read certain japanese fonts and returns cid value

Kurinosuke118 commented 6 months ago

Bug report

pdfminer.six fails to read certain japanese fonts and returns cid value (cid:xxx). I think this is caused by pdfminer's CMap not being able to convert the cid code of a particular Japanese font to a character code.

To Reproduce This bug occurs when the following Japanese PDF file is parsed with the following code.

PDF file 　https://www.maff.go.jp/j/tokei/kouhyou/naisui_gyosei/attach/pdf/index-15.pdf

Code

from pdfminer.high_level import extract_text

text = extract_text("xxx.pdf")
print(text)

Ouput

...
(cid:1748)(cid:10492)年(cid:2233)(cid:2286)(cid:2225)

年(cid:2057)(cid:7711)
...

Comment

I think that the file contents in [1] need to update with the file contents in [2] and update a CMap with tools/conv_cmap.py in the pdfminer.six repository.

[1] https://github.com/pdfminer/pdfminer.six/blob/master/cmaprsrc/cid2code_Adobe_Japan1.txt [2] https://github.com/adobe-type-tools/cmap-resources/blob/master/Adobe-Japan1-7/cid2code.txt

pietermarsman commented 6 months ago

Thanks for the tip! Do you happen to know if Japan1-7 is a superset of Japan1? In other words, that we are fully backwards compatible if we switch from Japan1 to Japan1-7?

Kurinosuke118 commented 6 months ago

@pietermarsman Thanks for the reply!

Do you happen to know if Japan1-7 is a superset of Japan1?

I think that Adobe-Japan1-7/cid2code.txt is a superset of cid2code_Adobe_Japan1.txt. When I rewrite the contents of the cid2code_Adobe_Japan1.txt to Adobe-Japan1-7/cid2code.txt, need to create a *.pickle.gz file? Please tell me how to create the *.pickle.gz file, because I'll try this.

pietermarsman commented 6 months ago

I'm not an expert on this myself, the files were already there when I first ran into pdfminer. But I can give you some directions:

The Makefile has a couple of commands to build the pickle files
It uses the tools/conv_cmap.py file to do the conversion
You probably want to add/change one of the cmaps in cmapsrc.

pdfminer / pdfminer.six

Bug: pdfminer.six fails to read certain japanese fonts and returns cid value #927