Incorrect Unicode Handling causes XLSX files to be unreadable.

Describe the bug A couple of weeks ago I submitted a PR for #75, which updated excel4node to accept XML 1.1-valid characters as cell content. This worked, mostly, until a customer of mine attempted to download XLSX reports that had input data from certain multi-code-point emoji (the skin-tone emoji), which resulted in Microsoft Excel being unable to open the resulting file, with the message:

We found a problem with some content in 'filename.xlsx.  Do you want us to try to recover as much as we can?

Further investigation and reading into the ECMA-376 specification, and the Microsoft Excel extensions to the Office Open XML SpreadsheetML File Format, revealed a problem with my PR: XLSX files are strictly XML 1.0 compliant, not 1.1.

I have a PR to go along with this issue, based on the documentation, which addresses the following:

Greatly expanded the unit tests for cells containing unicode strings.
Added some additional tests to include XML control characters that should be escaped.
Changed Cell.stringSetter to evaluate all codepoints instead of using regular expressions on entire strings (faster and more reliable to maintain)

To Reproduce Create a worksheet from a set of data that includes various multi-point unicode characters above U+FFFF, and other characters that are valid in XML 1.1 but not XML 1.0, and then attempt to open it in Excel / LibreOffice / etc.

Create a worksheet from a set of data that includes invalid control characterts such as \u000b (vertical tab).

Expected behavior The document should be valid, and open, and the characters should be presented correctly by the system in their encoded form.

Environment (please complete the following information):

Mac OSX Mojave 10. 14
Node v8.11.2
excel4node Version 1.6.0

Additional context Add any other context about the problem here. Log entries related to the issue are good things.

Update post-release for clarity

The list of XML valid characters is well known:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That regex would look like /[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u, but that final character range is currently an invalid regular expression in Node 8

> '☕️'.match(/\u2615/u);
[ '☕', index: 0, input: '☕️' ]
> '😂'.match(/\u1f602/u);
null

I have updated stringSetter to use a much simplified removeInvalidXml function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.

Under the hood, xmlbuilder handles almost all the encoding and escaping properly: several control charactes such as vertical tab are legal in XML but not in the OfficeML specification, which is why the codepoint evaluation was added.

natergj / excel4node

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

Update post-release for clarity