jrmuizel / pdf-extract

A rust library for extracting content from pdfs
364 stars 73 forks source link

thread 'main' panicked at 'assertion failed: name == \"Identity-H\" #57

Open wingjson opened 1 year ago

wingjson commented 1 year ago

hi,report this

jrmuizel commented 1 year ago

Can you provide an example PDF where this happens?

sftse commented 9 months ago

I've seen two ways in which this can fail: ascii characters in names can be non-canonically encoded: "/Identity-H" as "/Identity%2dH" and "/Identity-V" may be used instead of "/Identity-H" which should both be treated identically according to the standard.

jrmuizel commented 9 months ago

@sftse do you have a reference in the standard for that behaviour?

sftse commented 9 months ago

The standard mentions Identity-V five times, and the only information I can gather from them is p.275 Table 118 "Vertical version of Identity-H. The mapping is the same as for Identity-H."

sftse commented 9 months ago

I'm mistaken, the standard does spell out how to treat vertical writing differently, see table 117 where some entries may only exist for vertical writing, such as /DW2. For the purposes of text encoding specifically, it seems /Identity-V is treated the same as /Identity-H.