Open sbstp opened 9 years ago
I discussed this a couple of weeks ago on reddit. Servo's html parser had to deal with this, and they created a superset of UTF-8
to deal with it. Someone could write partial surrogate pairs and it's perfectly valid.
One way of fixing this would be to disable D's checking of UTF-16 strings, maybe by using plain arrays of u16
s.
I also checked what V8 and SpiderMonkey do, they treat it as two distinct entities.
// Chrome:
> '🂡'.split("")
["�", "�"]
// Firefox:
> '🂡'.split("")
Array [ "�", "�" ]
Possibly we can use plain wchar arrays. That won't solve the issue of how to display these invalid characters though.
I tried to find out, alert('🂡'.split("")[0])
crashed Chromium, and Firefox displays a broken character.
I think displaying a broken character is the best we can hope to do? I'm guessing we should avoid using wstring and use wchar when referring to raw strings from JS, and filter the strings before displaying them from D code, to replace broken characters with empty squares or something?
I'm not really sure what to do. It seems to be a gray area of text management. Obviously, having lone surrogates shouldn't be allowed, but the way JavaScript strings work (no char/wchar
type), we don't have a choice.
The best solution might be to avoid using D's strings altogether, and create a type like jsstring
. At some point though, the jsstring
has to be converted from UTF-16 (maybe with lone surrogates) to some encoding. We need something that makes it easy to convert a jsstring
for IO.
jsstring
is basically the same as wstring
, but it doesn't check for lone surrogates.
Just so I'm clear (you know more about unicode than I do), a lone surrogate is basically an extension to a char code, but missing the char code it should be attached to?
I guess the next question is, is it the creation of a wstring that chokes, or is it trying to write invalid UTF-16 data for output?
Yea, a surrogate pair is two wchar
that represent one code point. It's used for all code points over 2^16 in UTF-16. A lone surrogate is one element of the pair separated from the other, creating a malformed code point.
I just tried creating a lone surrogate without printing it, and no error occurred. So it seems to be in the IO layer.
Sorry that I haven't taken care of this yet. I've been very busy with all sorts of things that are more directly related to my thesis work. Would you care to take a stab at a fix? Bounce some more specific ideas for a fix? I think the main target for the fix should be the output. We should identify specifically which D function chokes on the invalid string.
This issue is caused by UTF-16. It encodes characters that are not on the BMP (first unicode plane) using two wide chars (surrogate pair), instead of one. Issues stem when a string is created containing a lone surrogate. For instance:
This can also escalate into a segfault, for example:
It also breaks on things like:
'🂡'.split("")
new String('🂡')