End-of-line characters removed from extracted XML

bitsgalore commented 10 years ago

If extracted XML contains end-of-line characters, they are removed in jpylyzer's output. This happens here (byteconv.py):

def removeControlCharacters(string):
    # Remove control characters from string
    # Source: http://stackoverflow.com/a/19016117/1209004
    return "".join(ch for ch in string if unicodedata.category(ch)[0]!="C")

I tried to fix this by changing the Unicode category code from C to Cc. However, in that case null characters aren't filtered out, even though they are part of the Cc category. See:

http://www.fileformat.info/info/unicode/category/Cc/list.htm

Might be a bug in unicodedata. For now I'll just leave it as it is.

bitsgalore commented 10 years ago

Also affects this issue:

https://github.com/openplanets/jpylyzer/issues/47

bitsgalore commented 10 years ago

Fixed in 1.12.1: added exception for tab, newline and carriage return (these controls chars are permitted in all XML versions) in removeControlCharacters.

openpreserve / jpylyzer

End-of-line characters removed from extracted XML #47