openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
http://jpylyzer.openpreservation.org/
Other
69 stars 28 forks source link

End-of-line characters removed from extracted XML #47

Closed bitsgalore closed 10 years ago

bitsgalore commented 10 years ago

If extracted XML contains end-of-line characters, they are removed in jpylyzer's output. This happens here (byteconv.py):

def removeControlCharacters(string):
    # Remove control characters from string
    # Source: http://stackoverflow.com/a/19016117/1209004
    return "".join(ch for ch in string if unicodedata.category(ch)[0]!="C")

I tried to fix this by changing the Unicode category code from C to Cc. However, in that case null characters aren't filtered out, even though they are part of the Cc category. See:

http://www.fileformat.info/info/unicode/category/Cc/list.htm

Might be a bug in unicodedata. For now I'll just leave it as it is.

bitsgalore commented 10 years ago

Also affects this issue:

https://github.com/openplanets/jpylyzer/issues/47

bitsgalore commented 10 years ago

Fixed in 1.12.1: added exception for tab, newline and carriage return (these controls chars are permitted in all XML versions) in removeControlCharacters.