Considering using a WTF-8 (like) representation for strings

svaarala commented 7 years ago

See #1695.

Experiment for string intern changes: #1698.

svaarala commented 7 years ago

The "as if" behavior for script code can be divided into two parts:

Behavior when creating new strings
Behavior when reading existing strings

For creating new strings, one simple initial step is to take advantage of the fact that all candidate strings go through string intern processing: no string enters the string table without being interned. So it should suffice to do surrogate pair combination initially during string interning. The main downsides compared to current behavior are:

Currently the candidate byte sequence is interned as is, or found to already exist as is in the string table. With surrogate pair combination, one would need to potentially first transcode the string into a separate buffer (or in-place if the input buffer was a scratch) before being able to do an intern check.
Currently duk_hstring charlen is lazy which saves some time for string temporaries whose .length is never read. This would still be possible, but would be less effective, because the input would need to be scanned for surrogates anyway during intern processing. Maybe that scan would be a bit faster than length computation, but not by very much.

Internal call sites could use a variant where the byte array input to string interning would be known to be a scratch buffer. If so, one can combine surrogates inline because the combined codepoints encode to smaller space than the inputs. Most internal call sites should be like this.

Short strings could use a small stack buffer for transcoding to avoid memory churn; most strings are relatively short.

fatcerberus commented 7 years ago

For strings which originate internally, could you just skip the surrogate combination logic since the string is then known to be in canonical form (i.e. "trusted")?

svaarala commented 7 years ago

Sure, if one combines two 1kB strings that are in canonical form, one only needs to check for unjoined surrogates at the join point.

But like I said above, doing it in string interning would be a simple first step because it would be obviously correct and cover all cases automatically. Optimizations are another matter.

One main question is what a minimum implementation would be, so that if this change was made, how much work would be needed to have the external behavior in the new form. This is an important question because the intermediate steps are not functional (= not releaseable) and the longer that transition is, the more awkward the process. Also having an externally functional implementation would provide a good basis for figuring out if the change is worth the changes.

svaarala commented 7 years ago

So considering some optimization possibilities, it'd probably be good to integrate the normalization process into string interning (which is always a required step when creating new strings). But call sites could provide some additional information for the process to allow optimization:

Naive call site: doesn't know if surrogate pair combination is needed, so interning can do it. This would be the initial step so that all call sites would work, but require some unnecessary processing.
Aware call site: has combined surrogate pairs in a context specific manner (e.g. for join, checks the join point), so interning can just assert for it without transcoding. Also result character length may be known by the caller based on the input string character length and the result of the local normalization process.
The call site can also indicate whether the source buffer is a temporary that can be written over. If so, the normalization (if made by string interning) could overwrite the input buffer directly. This always works because normalized surrogate pairs encode to fewer bytes than the encoded surrogate pair.

String interning is also a good place to assert that all strings conform to whatever policies are in place, because all strings are ultimately interned before being referencable.

svaarala commented 7 years ago

@fatcerberus So just to clarify re: your question:

For strings which originate internally, could you just skip the surrogate combination logic since the string is then known to be in canonical form (i.e. "trusted")?

Even if the string is in canonical form, it may contain unpaired surrogates that get paired later on when the string is e.g. combined with another string in canonical form.

For example, using the experiment in #1698 (ignore the ugly formatting of non-BMP codepoints):

// String is already in canonical form: no unpaired surrogates, so interned as is.
duk> x = '\udc00foo\ud800'
= "\udc00foo\ud800"

// When combining with itself, U+D800 + U+DC00 combine to U+10000 in the middle,
// even if inputs are already normalized.
duk> x + x
= "\udc00foo\U00010000foo\ud800"

// Similarly some other internal operations like .repeat() can cause characters
// to combine.
duk> x.repeat(3)
= "\udc00foo\U00010000foo\U00010000foo\ud800"

A similar example, input string 'x' is in canonical form because a high-high pair doesn't combine. However, a string replace operation makes the pair valid and it must be combined:

duk> x = '\ud800\ud801'
= "\ud800\ud801"
duk> y = x.replace('\ud801', '\udc01')
= "\U00010001"

There are quite many optimizations one can take to minimize surrogate pair checks. Considering .repeat() for example: a surrogate pair check must be done for every join point but anything in the middle can be assumed to be already in combined form. There are actually a variety of approaches for .repeat() (similarly for every individual operation):

Naive: just copy the strings and apply a surrogate pair check for the result. This is what #1698 does; and it does so automatically for all call sites creating strings.
Check for any surrogates getting combined for every copy of the string appended, but only at the join point. Reduces checks a lot if the repeated input is long.
Check for any surrogates in the join point, but instead of appending one copy of the input at every step, use the output produced so far and append 1, 2, 4, 8, etc copies of the input. The surrogate check is only needed for each join point, and e.g. when appending previously processed 8 copies, only one surrogate check is needed because the internal join points are already processed.
Any of the above, but before doing anything, check if the start and the end of the input have surrogate pairs that can be combined. If not, a naive approach will produce a correct output without any surrogate checks. The surrogate can also be paired before the copying loop so that the output will be: HIGH INPUT' S INPUT' S ... S INPUT' LOW where HIGH,LOW are the surrogates, S is the combined UTF-8 sequence, and INPUT' is the original input with the initial HIGH and trailing LOW surrogate codepoint removed.

So, quite a lot of alternatives, especially if one wants to avoid the surrogate pair checks whenever possible.

svaarala / duktape

Considering using a WTF-8 (like) representation for strings #1696