String issues with lone surrogate code units

sbstp commented 9 years ago

This issue is caused by UTF-16. It encodes characters that are not on the BMP (first unicode plane) using two wide chars (surrogate pair), instead of one. Issues stem when a string is created containing a lone surrogate. For instance:

> String.fromCodePoint(0x1F0A1)[0] // [1] too
std.utf.UTFException@/usr/include/dmd/phobos/std/utf.d(1260): surrogate UTF-16 high value past end of string
----------------
higgs(pure @safe void std.array.Appender!(immutable(char)[]).Appender.put!(immutable(wchar)[]).put(immutable(wchar)[])+0x6b) [0x56ebdb]
higgs(immutable(char)[] runtime.vm.ValuePair.toString()+0x659) [0x5f5479]
higgs(void repl.repl(runtime.vm.VM)+0x15a) [0x6907aa]
higgs(_Dmain+0x835) [0x69b39d]
higgs(_D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv+0x1f) [0x6a1ed7]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate())+0x2a) [0x6a1e2a]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll()+0x30) [0x6a1e90]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate())+0x2a) [0x6a1e2a]
higgs(_d_run_main+0x1dc) [0x6a1da4]
higgs(main+0x17) [0x69b557]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f75c741aec5]

This can also escalate into a segfault, for example:

h> print('🂡'[0].toString())

Caught segmentation fault
IP=null
exiting

It also breaks on things like:

'🂡'.split("")
new String('🂡')

sbstp commented 9 years ago

I discussed this a couple of weeks ago on reddit. Servo's html parser had to deal with this, and they created a superset of UTF-8 to deal with it. Someone could write partial surrogate pairs and it's perfectly valid.

One way of fixing this would be to disable D's checking of UTF-16 strings, maybe by using plain arrays of u16s.

sbstp commented 9 years ago

I also checked what V8 and SpiderMonkey do, they treat it as two distinct entities.

// Chrome:
> '🂡'.split("")
["�", "�"]
// Firefox:
> '🂡'.split("")
Array [ "�", "�" ]

maximecb commented 9 years ago

Possibly we can use plain wchar arrays. That won't solve the issue of how to display these invalid characters though.

sbstp commented 9 years ago

I tried to find out, alert('🂡'.split("")[0]) crashed Chromium, and Firefox displays a broken character.

maximecb commented 9 years ago

I think displaying a broken character is the best we can hope to do? I'm guessing we should avoid using wstring and use wchar when referring to raw strings from JS, and filter the strings before displaying them from D code, to replace broken characters with empty squares or something?

sbstp commented 9 years ago

I'm not really sure what to do. It seems to be a gray area of text management. Obviously, having lone surrogates shouldn't be allowed, but the way JavaScript strings work (no char/wchar type), we don't have a choice.

The best solution might be to avoid using D's strings altogether, and create a type like jsstring. At some point though, the jsstring has to be converted from UTF-16 (maybe with lone surrogates) to some encoding. We need something that makes it easy to convert a jsstring for IO.

jsstring is basically the same as wstring, but it doesn't check for lone surrogates.

maximecb commented 9 years ago

Just so I'm clear (you know more about unicode than I do), a lone surrogate is basically an extension to a char code, but missing the char code it should be attached to?

I guess the next question is, is it the creation of a wstring that chokes, or is it trying to write invalid UTF-16 data for output?

sbstp commented 9 years ago

Yea, a surrogate pair is two wchar that represent one code point. It's used for all code points over 2^16 in UTF-16. A lone surrogate is one element of the pair separated from the other, creating a malformed code point.

I just tried creating a lone surrogate without printing it, and no error occurred. So it seems to be in the IO layer.

maximecb commented 9 years ago

Sorry that I haven't taken care of this yet. I've been very busy with all sorts of things that are more directly related to my thesis work. Would you care to take a stab at a fix? Bounce some more specific ideas for a fix? I think the main target for the fix should be the output. We should identify specifically which D function chokes on the invalid string.

higgsjs / Higgs

String issues with lone surrogate code units #184