maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

Decoding issue - (cid:XX) #81

Closed maxpmaxp closed 2 years ago

maxpmaxp commented 3 years ago

See https://github.com/maxpmaxp/pdfreader/issues/77#issuecomment-800916666 MTB0321.pdf

I have installed the latest version, but it does not work good. I guess that you replaced values for (cid:) with ascii. I have tried that, but it didnt worked. This is the result that Im getting:

SUHVVXP\x031HHUDFKHU\x030LWWHLOXQJVEODWW\x035HGDNWLRQ\x03_\x03/D\\RXW\x03*HPHLQGHYHUZDOWXQJ\x031HHUDFK\x03\x03\x03\x037LWHOELOG\x03(GLWK\x036HQQ\x03_\x031HHUDFK\x03\x03\x03$XIODJH\x03XQG\x039HUVDQG\x03\x14µ\x19\x19\x13\x03([HPSODUH\x03_\x035HF\\FOLQJSDSLHU\x03_\x03HUVFKHLQW\x03PRQDWOLFK\x03\x03\x03DQ\x03DOOH\x03+DXVKDOWXQJHQ\x03GHU\x03*HPHLQGH\x031HHUDFK\x03\x03\x03'UXFN\x03JQGUXFN\x03$*\x03_\x03%DFKHQE\x81ODFK\x03\x03\x035HGDNWLRQVVFKOXVV\x03MHZHLOV\x03GHU\x03\x14\x15\x11\x037DJ\x03GHV\x030RQDWV\x03\x03\x03\n\nPage: 2 / 32\n\n\n9HUKDQGOXQJHQ\x03GHV\x03*HPHLQGHUDWHV\x03\x16_\x15\x13\x15\x14\x03\x03\x03\x14\x03&RURQDYLUXV\x03,QIRUPDWLRQHQ\x03_\x036WDQG\x03\x14\x19\x11\x03)HEUXDU\x03\x15\x13\x15\x14\x036lPWOLFKH\x03,QIRUP]()

I guess that you translated numbers inside the (cid:) to chars with ascii. In that way, you will get chars, but not correct chars. I fixed the issue with adding number 29 to each number inside (cid:... )

maxpmaxp commented 3 years ago

There is an issue in decoding strings written with TT2 font. Everything that comes after /TT2 1 Tf is decoded improperly.

maxpmaxp commented 3 years ago

it looks like this Type-0 font is a mess.

maxpmaxp commented 2 years ago

The font is broken, the glyphs have incorrect codes.