svaarala / duktape

Duktape - embeddable Javascript engine with a focus on portability and compact footprint
MIT License
5.96k stars 516 forks source link

Considering using a WTF-8 (like) representation for strings #1696

Open svaarala opened 7 years ago

svaarala commented 7 years ago

See #1695.

Experiment for string intern changes: #1698.

svaarala commented 7 years ago

The "as if" behavior for script code can be divided into two parts:

For creating new strings, one simple initial step is to take advantage of the fact that all candidate strings go through string intern processing: no string enters the string table without being interned. So it should suffice to do surrogate pair combination initially during string interning. The main downsides compared to current behavior are:

Internal call sites could use a variant where the byte array input to string interning would be known to be a scratch buffer. If so, one can combine surrogates inline because the combined codepoints encode to smaller space than the inputs. Most internal call sites should be like this.

Short strings could use a small stack buffer for transcoding to avoid memory churn; most strings are relatively short.

fatcerberus commented 7 years ago

For strings which originate internally, could you just skip the surrogate combination logic since the string is then known to be in canonical form (i.e. "trusted")?

svaarala commented 7 years ago

Sure, if one combines two 1kB strings that are in canonical form, one only needs to check for unjoined surrogates at the join point.

But like I said above, doing it in string interning would be a simple first step because it would be obviously correct and cover all cases automatically. Optimizations are another matter.

One main question is what a minimum implementation would be, so that if this change was made, how much work would be needed to have the external behavior in the new form. This is an important question because the intermediate steps are not functional (= not releaseable) and the longer that transition is, the more awkward the process. Also having an externally functional implementation would provide a good basis for figuring out if the change is worth the changes.

svaarala commented 7 years ago

So considering some optimization possibilities, it'd probably be good to integrate the normalization process into string interning (which is always a required step when creating new strings). But call sites could provide some additional information for the process to allow optimization:

String interning is also a good place to assert that all strings conform to whatever policies are in place, because all strings are ultimately interned before being referencable.

svaarala commented 7 years ago

@fatcerberus So just to clarify re: your question:

For strings which originate internally, could you just skip the surrogate combination logic since the string is then known to be in canonical form (i.e. "trusted")?

Even if the string is in canonical form, it may contain unpaired surrogates that get paired later on when the string is e.g. combined with another string in canonical form.

For example, using the experiment in #1698 (ignore the ugly formatting of non-BMP codepoints):

// String is already in canonical form: no unpaired surrogates, so interned as is.
duk> x = '\udc00foo\ud800'
= "\udc00foo\ud800"

// When combining with itself, U+D800 + U+DC00 combine to U+10000 in the middle,
// even if inputs are already normalized.
duk> x + x
= "\udc00foo\U00010000foo\ud800"

// Similarly some other internal operations like .repeat() can cause characters
// to combine.
duk> x.repeat(3)
= "\udc00foo\U00010000foo\U00010000foo\ud800"

A similar example, input string 'x' is in canonical form because a high-high pair doesn't combine. However, a string replace operation makes the pair valid and it must be combined:

duk> x = '\ud800\ud801'
= "\ud800\ud801"
duk> y = x.replace('\ud801', '\udc01')
= "\U00010001"

There are quite many optimizations one can take to minimize surrogate pair checks. Considering .repeat() for example: a surrogate pair check must be done for every join point but anything in the middle can be assumed to be already in combined form. There are actually a variety of approaches for .repeat() (similarly for every individual operation):

So, quite a lot of alternatives, especially if one wants to avoid the surrogate pair checks whenever possible.