openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
164 stars 78 forks source link

encodeValue and encodeContent do not escape some invalid XML characters #877

Open karenhanson opened 11 months ago

karenhanson commented 11 months ago

I came across an issue where a message contained invalid characters (specifically 0xfffe and 0xffff). This caused the XML output to be invalid when parsed... this is even though the messages are going through Utils.encodeContent(String content), which should clean up the string for XML. This documentation indicates that the two characters are forbidden.

I will do a PR for a proposed fix to address this problem, but wanted to log an issue to attach the fix to.

I also noted something else while troubleshooting:

The document linked above also lists "surrogates" as forbidden and says some characters are "discouraged though allowed." Of the discouraged characters, the JHOVE Utility only removes one... 0x7f. I thought there may be a standard function to clean XML that's usable and looked at org.apache.commons.lang3.StringEscapeUtils.escapeXml10. It handles surrogates, escapes the discouraged characters (which are XML version specific), but also encodes quote, greater than, and apostrophe in all cases... which may not be a good thing for readibility of messages. Anyway, I mention it because I think the code might be useful if it seems important to explore escaping further, or we need to handle the unicode surrogates.