natergj / excel4node

Node module to allow for easy Excel file creation
MIT License
1.38k stars 215 forks source link

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

Closed mschnee closed 6 years ago

mschnee commented 6 years ago

Describe the bug A couple of weeks ago I submitted a PR for #75, which updated excel4node to accept XML 1.1-valid characters as cell content. This worked, mostly, until a customer of mine attempted to download XLSX reports that had input data from certain multi-code-point emoji (the skin-tone emoji), which resulted in Microsoft Excel being unable to open the resulting file, with the message:

We found a problem with some content in 'filename.xlsx.  Do you want us to try to recover as much as we can?

Further investigation and reading into the ECMA-376 specification, and the Microsoft Excel extensions to the Office Open XML SpreadsheetML File Format, revealed a problem with my PR: XLSX files are strictly XML 1.0 compliant, not 1.1.

I have a PR to go along with this issue, based on the documentation, which addresses the following:

To Reproduce Create a worksheet from a set of data that includes various multi-point unicode characters above U+FFFF, and other characters that are valid in XML 1.1 but not XML 1.0, and then attempt to open it in Excel / LibreOffice / etc.

Create a worksheet from a set of data that includes invalid control characterts such as \u000b (vertical tab).

Expected behavior The document should be valid, and open, and the characters should be presented correctly by the system in their encoded form.

Environment (please complete the following information):

Additional context Add any other context about the problem here. Log entries related to the issue are good things.

Update post-release for clarity

The list of XML valid characters is well known:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That regex would look like /[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u, but that final character range is currently an invalid regular expression in Node 8

> '☕️'.match(/\u2615/u);
[ '☕', index: 0, input: '☕️' ]
> '😂'.match(/\u1f602/u);
null

I have updated stringSetter to use a much simplified removeInvalidXml function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.

Under the hood, xmlbuilder handles almost all the encoding and escaping properly: several control charactes such as vertical tab are legal in XML but not in the OfficeML specification, which is why the codepoint evaluation was added.

natergj commented 6 years ago

PR 239 has been released as part of version 1.7.0. Thanks, again @mschnee