openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
169 stars 79 forks source link

Incorrect String interpretation #785

Open jackdos opened 2 years ago

jackdos commented 2 years ago

I have a PDF (unfortunately unable to share) with the line:

/Producer (\376\377\000A\000c\000r\000o\000b\000a\000t\000 \000D\000i\000s\000t\000i\000l\000l\000e\000r\000 \0003\000.\0000\0001\000 \000f\000o\000r\000 \000W\000i\000n\000d\000o\000w\000s)

which should be read as "Acrobat Distiller 3.01 for Windows", but which actually gets read as "ぁっひはぢちぴ〠いどびぴぜ൩ぬぬづひ〠〳〮〰〱〠てはひ〠ぜൗどのつはぷび".

The string seems to be written with a mix of ASCII and octal representations, \376\377 is 0xFEFF (i.e. the Big Endian UTF-16 BOM), and is followed by a series of byte pairs representing UTF-16 2-byte characters with an octal first byte (\000 in all cases here) and an ansii second byte.

carlwilson commented 2 years ago

Hi @jackdos this looks like the PDF module is failing to read the BOM properly, will take a look to see if it's feasible to fix this for 1.28 and will report back by end of October.