I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.
Input
If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as _x000D_.
An underscore is also not allowed, and that one is encoded as _x005F_.
This means that a carriage return is encoded as _x005F_x000D_.
A document with a newline is properly parsed by the library.
I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.
When a cell contains the literal string _x000D_ it is parsed as _x005F_x000D_.
Guess
I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with _x005F, which results in the entire string being represented as _x005F_x000D_.
This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as _x005F_x000D_.
I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.
Input
If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as
_x000D_
. An underscore is also not allowed, and that one is encoded as_x005F_
. This means that a carriage return is encoded as_x005F_x000D_
. A document with a newline is properly parsed by the library.I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.
When a cell contains the literal string
_x000D_
it is parsed as_x005F_x000D_
.Guess
I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with
_x005F
, which results in the entire string being represented as_x005F_x000D_
.This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as
_x005F_x000D_
.Proof
I have a test case in https://github.com/m1dnight/xlsxir/commit/c335061467569e0cbe64643314d5e428563bd1df this commit that shows the behavior.
I'm not sure though, if this is a bug in SAX or not.
Any ideas on how to proceed?