jsonkenl / xlsxir

Xlsx parser for the Elixir language.
MIT License
212 stars 83 forks source link

Wrong result when parsing escaped unicode characters #120

Open m1dnight opened 1 year ago

m1dnight commented 1 year ago

I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.

Input

If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as _x000D_. An underscore is also not allowed, and that one is encoded as _x005F_. This means that a carriage return is encoded as _x005F_x000D_. A document with a newline is properly parsed by the library.

I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.

When a cell contains the literal string _x000D_ it is parsed as _x005F_x000D_.

Guess

I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with _x005F, which results in the entire string being represented as _x005F_x000D_.

This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as _x005F_x000D_.

Proof

I have a test case in https://github.com/m1dnight/xlsxir/commit/c335061467569e0cbe64643314d5e428563bd1df this commit that shows the behavior.

I'm not sure though, if this is a bug in SAX or not.

Any ideas on how to proceed?