nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.83k stars 29.72k forks source link

TextDecoder does not error incorrectly for legacy byte sequences #40091

Open domenic opened 3 years ago

domenic commented 3 years ago

Version

v16.9.1

Platform

Microsoft Windows NT 10.0.19043.0 x64

Subsystem

encoding

What steps will reproduce the bug?

Enter the following in the REPL:

new TextDecoder("Big5").decode(new Uint8Array([0x83, 0x5C])).charCodeAt(0).toString(16)

as well as

new TextDecoder("Big5").decode(new Uint8Array([0x83, 0x5C])).charCodeAt(1).toString(16)

How often does it reproduce? Is there a required condition?

Every time

What is the expected behavior?

fffd for the first, and 5c for the second (as in Firefox and Chrome, and per the WHATWG Encoding Standard)

What do you see instead?

f00e and NaN

Additional information

I suspect this has to do with you using ICU as-is, instead of properly patching it to match the Encoding Standard. There are probably more bugs like this.

@inexorabletash may be able to point to where in the Chromium source tree we keep our ICU encoding patches.

inexorabletash commented 3 years ago

In ICU, it's important to use the "HTML" tag when selecting encodings, which should select the web-compatible encodngs based on Encoding. That started off as Chromium-specific but was upstreamed.

For Chromium's remaining customization, https://source.chromium.org/chromium/chromium/src/+/main:third_party/icu/README.chromium is the right place to start, and it references a patch for gb18030/windows-936

I should note that Chrome differs from the Encoding standard for a few encodings (e.g. GBK / GB18030 have differences)

domenic commented 1 year ago

This problem remains on Node.js v21.1.0.

Deno 1.38.0 does not have this problem, at least for the minimal test case in the OP. I suspect Bun does not either since it uses WebKit's TextEncoder/TextDecoder implementation.

Relevant web platform tests are https://wpt.fyi/results/encoding?label=experimental&label=master&aligned .