String arrives with different characters in Firefox

mk-pmb commented 7 years ago

I wrote a node.js program to generate HTML code, the core:

console.log(['\uFEFF<!DOCTYPE html><html><body><p></p><script>',
  String(charCodes),
  'console.log(charCodes(' + jsStringify(animals) + '));',
  'console.log(' + charCodes(jsStringify(animals)) + ');',
  '</script></body></html>'].join('\n'));

I gisted the full source, also the generated HTML and what Firefox prints in its console when loading the HTML:

{ "dog": "[d83d][dc15]", "cow": "[d83d][dc04]", "halfbreed": "[fffd][fffd]" }
{ "dog": "[d83d][dc15]", "cow": "[d83d][dc04]", "halfbreed": "[dc15][d83d]" }

As you can see, in the first expression, the halfbreed characters both became U+FFFD. Probably some Unicode interpolation is messing with JavaScript's UCS-2 characters. Is this a bug? If a feature instead, it should be documented more prominently.

ForbesLindesay commented 7 years ago

You may need to add <meta charset="utf8"> to the head of your HTML file. I don't see working around that as being js-stringify's responsibility.

mk-pmb commented 7 years ago

Good idea! I added

console.log(document.characterSet);

to verify whether Firefox respects the BOM at start of my HTML file, and it seems it did:

UTF-8
{ "dog": "[d83d][dc15]", "cow": "[d83d][dc04]", "halfbreed": "[fffd][fffd]" }
{ "dog": "[d83d][dc15]", "cow": "[d83d][dc04]", "halfbreed": "[dc15][d83d]" }

Still, I added to the HTML:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Same output in Firefox's console. So I check the script output itself (cat is to mute E_PIPE):

$ nodejs 0_dogcow.js | cat | sed -nre 's~^.*eed\S+ ~~p' | hd
00000000  5c 22 ef bf bd ef bf bd  5c 22 20 7d 22 29 29 3b  |\"......\" }"));|
00000010  0a 5c 22 5b 64 63 31 35  5d 5b 64 38 33 64 5d 5c  |.\"[dc15][d83d]\|
[…]
$ surrog8-js $'\xEF\xBF\xBD'
\uFFFD

As you see at the beginning between the "5c 22" (\"), there's "ef bf bd ef bf bd", so even before decoding them we can see it's twice the same character (3 bytes each). Decoding verifies that it's actually the characters that Firefox claims them to be.

PS: For completeness, I'd normally have tested your original suggestion <meta charset="utf8"> as well, but I think that's pointless after the hexdump.

mk-pmb commented 7 years ago

I boiled down the hex test case for test without Firefox. New core:

var dog = '\uD83D\uDC15', cow = '\uD83D\uDC04', halves = dog[1] + cow[0],
  jsStringify = require('js-stringify');
console.log('orig:  <' + halves + '>');
console.log('JSON:  ' + JSON.stringify(halves));
console.log('js-str:' + jsStringify(halves));

The hex dump shows that they only differ in one byte, (Edit: yeah, the angle bracket.) 2nd column from the right, and that JSON.stringify has the same problem:

$ nodejs 3_dogcow_cli.js | hd
00000000  6f 72 69 67 3a 20 20 3c  ef bf bd ef bf bd 3e 0a  |orig:  <......>.|
00000010  4a 53 4f 4e 3a 20 20 22  ef bf bd ef bf bd 22 0a  |JSON:  "......".|
00000020  6a 73 2d 73 74 72 3a 22  ef bf bd ef bf bd 22 0a  |js-str:"......".|

Edit: False positive angle bracket. Gonna investigate more.

mk-pmb commented 7 years ago

Looks like Node.js's stdout cannot (edit: with default config) write those UCS-2 characters:

$ nodejs -p '"\uDC15\uD83D"' | hd
00000000  ef bf bd ef bf bd 0a                              |.......|
00000007
$ nodejs -e 'process.stdout.write("\uDC15\uD83D")' | hd
00000000  ef bf bd ef bf bd                                 |......|
00000006

… and I assume it's not a node bug but the UCS-2 chars just cannot be represented in stdout's default encoding, UTF-8. So what if I change the encoding:

$ nodejs -e 'process.stdout.write("\uDC15\uD83D", "UCS-2");' | hd
00000000  15 dc 3d d8                                       |..=.|

That works! So I change it in the HTML generator:

html = ['\uFEFF<!DOCTYPE html><html><head>',
  // […]
  '</script></body></html>'].join('\n');
process.stdout.write(html, 'UCS-2');

Thanks to the BOM at start of HTML, Firefox now correctly detects UTF-16LE, ignores my meta tag, and… still reports "[fffd][fffd]". So even if I'd manage to re-configure all of my web stack to use UCS-2 instead of UTF-8, probably no luck. Also I think it would be overkill to convert all my templates, database replies etc. to UCS-2, when we have a lot of good infrastructure modules for UTF-8, all just to transmit some minor script tag correctly.

There has to be an easier way. How about I have js-stringify escape most data and then escape the non-UTF-8 chars myself?

console.log(['\uFEFF<!DOCTYPE html><html><head>',
  '<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">',
  '</head><body><p></p><script>',
  String(charCodes),
  'console.log(document.characterSet);',

  'console.log(charCodes(' +
    require('surrog8').uHHHH(jsStringify(animals)) +
    '));',

  'console.log(' + charCodes(jsStringify(animals)) + ');',
  '</script></body></html>'].join('\n'));

Yeah! That one works! Firefox detects UTF-8 and both objects arrive correctly! … because the HTML says:

console.log(charCodes("{ \"dog\": \"\uD83D\uDC15\", \"cow\": \"\uD83D\uDC04\", \"halfbreed\": \"\uDC15\uD83D\" }"));

So now we know how to fix the encoding, and we just have to clarify whether data integrity is part of js-stringify's "safely" claim:

Stringify an object so it can be safely inlined in JavaScript code

… or that's meant more as "secure" (defend against injection and similar), and I shall make a new module to combine that with my need for "verbatim".

mk-pmb commented 7 years ago

For the transition period, I made utf8safe-js-stringify. Still hope this will become the default and soon utf8safe-js-stringify can be just an alias for js-stringify.

pugjs / js-stringify

String arrives with different characters in Firefox #4