tc39 / proposal-regexp-unicode-sequence-properties

Proposal to add support for sequence properties in Unicode property escapes to ECMAScript regular expressions.
https://github.com/tc39/proposal-regexp-set-notation
72 stars 13 forks source link

ECMAScript proposal: support properties of strings (a.k.a. “sequence properties”) in Unicode property escapes

Status

This proposal is at stage 2 of the TC39 process.

Note that the RegExp v flag proposal subsumes this proposal – and also adds set notation & string literals to character classes.

Terminology

This proposal initially used the term “sequence properties”, but that is a misnomer. A sequence of characters is a string, and a string property is one whose values (the codomain) are strings, just like a binary property is one whose values are binary true/false (that is, whether the property applies or does not apply).

Unicode has since formalized this, using “property of code points” vs. “property of strings” for the domain of a property. See https://www.unicode.org/reports/tr18/#domain_of_properties.

Also, we mostly use “character” and “code point” interchangeably. More formally, “character” refers to assigned code points, but properties have values for all code points. (Most properties map all unassigned code points to one default value.)

Motivation

The Unicode Standard assigns various properties and property values to every character/code point. For example, the Unicode Character Database provides data for determining exactly the set of characters whose Script property value is Greek.

Unicode property escapes enable JavaScript developers to access these Unicode character properties natively in ECMAScript regular expressions.

const regexGreek = /\p{Script=Greek}/u;
regexGreek.test('π');
// → true

The Unicode properties and values that are currently supported in Unicode property escapes have something in common: they all expand to a set of code points. Such escapes can be transpiled as a character class containing the code points they match individually. For example, \p{ASCII_Hex_Digit} is equivalent to [0-9A-Fa-f]: it only ever matches a single Unicode character/code point at a time.

However, the Unicode Standard also defines several properties of strings. In regular expressions, such properties translate to a set of alternatives. To illustrate this, imagine a Unicode property that applies to the strings 'a', 'b', 'c', 'W', 'xy', and 'xyz'. This property translates to either of the following regular expression patterns (using alternation): xyz|xy|a|b|c|W or xyz|xy|[a-cW]. (Longest strings first, so that a prefix like 'xy' does not hide a longer string like 'xyz'.) Note how unlike existing Unicode property escapes, this pattern can match multi-character strings.

Hand-written regular expressions for these properties suffer from the same issues that Unicode property escapes solve: they’re hard to write or maintain manually, they tend to be large, and they’re unreadable.

Proposed solution

We propose the addition of several properties of strings to the existing Unicode property escapes syntax.

With this feature, the above regular expression could be written as:

const re = /\p{RGI_Emoji_ZWJ_Sequence}/u;
re.test('👨🏾‍⚕️'); // '\u{1F468}\u{1F3FE}\u200D\u2695\uFE0F'
// → true

We propose to support the following Unicode sequence properties defined in UTS18 and UTS51:

Each of these sequence properties expands to a finite, well-defined set of strings. (Basic_Emoji also applies to many single characters.)

Over time, we can choose to support additional properties of strings, following the upstream Unicode Standard.

High-level API

Re-using the existing Unicode property escapes syntax for this new functionality seems appropriate:

\p{PropertyName}

Where PropertyName can be one of the properties of strings listed above.

The complement of such a property is not supported: both \P{PropertyName} and [^…\p{PropertyName}…] throw an early SyntaxError exception if PropertyName is a property of strings.

We have thought of possible definitions of such a complement, but we believe that they are not generally useful.

Some of the use cases for “not a property of strings” can be supported via a negative lookahead: /(?!\p{RGI_Emoji_Flag_Sequence})\p{Symbol}/u.

Note: Using a property of strings inside a character class is equivalent to an alternation of all of the strings and characters, such that the order of elements is irrelevant (e.g., listing the strings longest-first). (This could be optimized by retaining a character class of the single characters, as illustrated in the Motivation section above.)

FAQ

What about backwards compatibility?

Unicode property escapes for unsupported Unicode properties throw an early SyntaxError. As such, we can add support for new properties in a backwards-compatible way, as long as we re-use the existing syntax.

Properties of strings within character classes

Currently, each property escape and character class expands to a set of code points, equivalent to an alternation of single characters. With this proposal, a property escape and character class expands to a set of strings, equivalent to an alternation of strings. In most cases, most or all of those strings will still be single-character strings.

For example: [\p{Emoji_Keycap_Sequence}\p{Symbol}] = #⃣|*⃣|0⃣|1⃣|…|9⃣|[\$+<->\^…℻⅀-⅄⅊-⅍…]

Why re-use \p{…} and not introduce new syntax?

Introducing new syntax comes at a cost for JavaScript developers. In this case, we assert that the cost of adding new syntax for this functionality outweighs the benefits.

New syntax could be used for properties of strings. However, such new syntax should also allow for properties of code points, so that, when a Unicode property no longer applies to multi-character strings in a later Unicode version, existing regular expressions remain valid.

Therefore, developers would be expected to know which property does, or did at one point, apply to strings, but it would be easier for them to simply switch to the new syntax for all properties.

Regular expressions can be validated by a parser using information about which property applies to strings vs. only single characters, without need for a new escape.

The mental model is: \p{…} refers to a Unicode property. It matches the elements of the property’s domain for which its value is true. This proposal doesn’t change that. It’s reasonable to assume that developers opting in to the use of properties of strings know what to expect.

Illustrative examples

Matching emoji sequences

With this proposal, the set of RGI (“recommended for general interchange”) emoji (characters and sequences!) can be trivially represented as a RegExp pattern in JavaScript:

const reRgiEmoji = /\p{RGI_Emoji}/u;

An equivalent regular expression without the use of property escapes is ~7 kB in size. With property escapes, but without support for properties of strings, the size is still ~4.5 kB. The abovementioned regular expression with sequence properties takes up 16 bytes.

Matching hashtags

Many applications (such as Twitter) use extended hashtags that allow for emoji characters. Unicode® Standard Annex #31 defines Extended Hashtag Identifier Syntax (UAX31-R8) as matching:

// From UAX #31, not in JavaScript syntax.
/[#﹟#][\p{XID_Continue}\p{Extended_Pictographic}\p{Emoji_Component}[-+_]-[#﹟#]]+/

The above pattern matches emoji, but also syntactically invalid emoji as well as emoji that are not recommended for general interchange. With the proposed feature however, matching hashtags with only valid and recommended emoji becomes feasible:

const reHashtag = /[#﹟#][[\p{XID_Continue}\p{RGI_Emoji}[-+_]]--[#﹟#]]+/u;

An equivalent regular expression without the use of property escapes is ~12 kB in size. With property escapes, but without support for properties of strings, the size is still ~3 kB. The abovementioned regular expression with sequence properties takes up 62 bytes.

Related UTC proposals

TC39 meeting notes

Specification

Implementations