mfenniak / pyPdf

Pure-Python PDF Library; this repository is no longer maintained, please see https://github.com/knowah/PyPDF2/ insead.
https://github.com/knowah/PyPDF2/
Other
276 stars 85 forks source link

Fail to read a text object #37

Open tdiwasa opened 12 years ago

tdiwasa commented 12 years ago

readStringFromStream() fails to create a string object if a text object like below was given.

BT 1 0 0 1 0 1.9 Tm /F3+0 8.6 Tf 10.5 TL (\376\377 ) Tj T* ET

readStringFromStream() decodes (\376\377 ) to a string '\xfe\xff\x20'.

createStringObject() checks first 2 bytes of the string, and will attempt to decode with UTF-16. Then an exception will be raised because '\x20' is illegal as UTF-16.

Apparently, a text "\376\377" should not be treated as BOM.

BOM check would be a conformance of "Text Strings" described in PDF Reference, but it should be applied only to the "text string" type item specified in PDF Reference.