jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

/ToUnicode spec violation #70

Open sftse opened 1 year ago

sftse commented 1 year ago

get_unicode_map includes a check that if the /ToUnicode key is a name, that it must be /Identity-H. I have searched the pdf spec and checked a bunch of personal pdfs to see whether this is justified. It seems to be a violation of the standard, is there an example of this occurring in practice?

jrmuizel commented 1 year ago

I added https://github.com/jrmuizel/pdf-extract/commit/db3a490ad047504e824ea8106f8c801e71e7c1b1 to make the intention here more clear.

sftse commented 1 year ago

From the PDF standard p.292 "This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values."

To clarify my question, it seems the PDF standard mandates the entry to be a stream object, is there an example where this entry was a name object?