pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.86k stars 271 forks source link

Unable to represent hexadecimal strings #128

Open josch opened 6 years ago

josch commented 6 years ago

Hi,

I have a PDF with an image object that contains:

/ColorSpace [ /Indexed /DeviceRGB 7 < 6b35d7 d63444 ac49a5 b5704d 5f8fb7 6ead75 5bd733 bbc535 > ]

It seems that pdfrw is unable to preserve the color palette. It sees it as a string and thus a roundtrip through pdfrw would output:

/ColorSpace [ /Indexed /DeviceRGB 7 (6b35d7 d63444 ac49a5 b5704d 5f8fb7 6ead75 5bd733 bbc535) ]

But this cannot be parsed by either mupdf or evince anymore.

josch commented 6 years ago

I now understand, that PDF supports hexadecimal "strings". Those are enclosed in < and > "brackets" and only contain hexadecimal numbers. And it seems that pdfrw doesn't support them yet. :frowning_face:

josch commented 6 years ago

Specifically I wonder: how do I make pdfrw generate output like this:

/ColorSpace [ /Indexed /DeviceRGB 7 < 6b35d7 d63444 ac49a5 b5704d 5f8fb7 6ead75 5bd733 bbc535 > ]

The functions PdfString.from_bytes and PdfString.from_unicode do not seem to be adequate for this task. Maybe the PdfString class needs a from_integer member function which would take a list of integer values which it would then transform into a pdf string according to the bytes_encoding argument value?