tc39 / ecma262

Status, process, and documents for ECMA-262
https://tc39.es/ecma262/
Other
14.95k stars 1.28k forks source link

`JSON.stringify` UTF-8 vs. UTF-16 #2387

Open jmm opened 3 years ago

jmm commented 3 years ago

Hello,

Since ES2019 the Introduction section says:

requiring that JSON.stringify return well-formed UTF-8 regardless of input

I don't think that's what it really means to say though, is it? I think it means to say that it returns well-formed UTF-16 (and as a result the content could be encoded as UTF-8)?

mathiasbynens commented 3 years ago

It should say that it returns well-formed Unicode strings.

jmm commented 3 years ago

Thanks for the feedback. Should the JSON.stringify section not say this then?:

The stringify function returns a String in UTF-16 encoded JSON format

jmdyck commented 3 years ago

(The intro sentence was added in 362cb10, presumably to summarize the effect of merging PR #1396.)

mathiasbynens commented 3 years ago

Thanks for the feedback. Should the JSON.stringify section not say this then?:

The stringify function returns a String in UTF-16 encoded JSON format

I think so. There's nothing special about the encoding of the returned string — it’s just a JavaScript string, like other JavaScript strings. (And yes, JavaScript does treat strings kind of like UCS-2/UTF-16, but that's not special for JSON.stringify’s return values.) What’s special is that the returned string is guaranteed to be well-formed Unicode.

jmm commented 3 years ago

Ok thanks. I'll preface this by acknowledging that I'm not tremendously well versed on this topic (though I'm significantly more informed than a few days ago, thanks in no small part to your "Well-formed JSON.stringify" proposal and "JavaScript’s internal character encoding" article).

Those are good points and I'm mostly in alignment with you. I've thought about it further and I think it probably does make sense to be more explicit on what it returns than "JSON string" though -- whether by referencing "UTF-16" or "well-formed Unicode".

"6.1.4 The String Type" seems a bit vague. It says:

[...] operations that further interpret String contents as sequences of Unicode code points encoded in UTF-16 must account for ill-formed subsequences. Such operations apply special treatment [...] [...] A code unit that is a leading surrogate or trailing surrogate, but is not part of a surrogate pair, is interpreted as a code point with the same value.

So I don't read that as saying those operations will necessarily return well-formed UTF-16 / Unicode.

On another note, "Well-formed JSON.stringify" says:

[...] consumers may still reject input that specifies strings including Unicode code points that are not scalar values [...], but those that accept it must have mechanisms for dealing with unpaired surrogates (as mentioned in the specification of JSON).

Referencing RFC 8259, which says:

The behavior of software that receives JSON texts containing [unpaired surrogates] is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.

(The ES spec references ECMA-404, which doesn't seem to say anything like that.)

Taking both of those things into account, I actually now think the most useful thing to say would be something to the effect that it returns a UTF-16 encoded or well-formed Unicode string regardless of the presence of unpaired surrogates in the input, but the JSON encoding still represents ill-formed Unicode text containing unpaired surrogates and results of parsing it (other than via JSON.parse) may be unpredictable.

mathiasbynens commented 3 years ago

cc @gibson042