srl295 / es-unicode-properties

Unicode properties in ES
https://srl295.github.io/es-unicode-properties/
MIT License
6 stars 0 forks source link

Unicode Character Properties in EcmaScript

Spec: https://srl295.github.io/es-unicode-properties/

Proposal for stage 0

Allows for a function to return Encoded Character Properties for a code point.

For applications, they can directly answer questions such as “What kind of script is 𞤘?”, “Is ġ lowercase? ”, or “What is the numeric value of ५?”.

For feature implementers, this is a required building block for implementing a wide array of higher level features, such as number parsing, segmentation, regular expressions, and much more.

Definitions

A property (for this purpose) is string (including enumerated types), a number, or a boolean.

Examples

CP Name Long Name Value Long Value Comments
и Gc General_Category Ll Lowercase_Letter Enumeration
𞤘 sc Script Adlm Adlam Enumeration
ġ Lower Lowercase true true Boolean
nv General_Category 5 5 Number

API Brainstorm For Discussion

Note: see Issues for further discussion.

"и".getUnicodeProperty("Gc", {type: "short"}) // "Ll"
"и".getUnicodeProperty("Gc", {type: "long"}) // "Lowercase_Letter"
"и".getUnicodeProperty("Gc") // "Lowercase_Letter"  - type:long is default
"и".getUnicodeProperty("General_Category") // "Lowercase_Letter" - "Gc" ≈ "General_Category"

FAQ

Why should this be in EcmaScript?

Data Size, Complexity, Performance, Updates.

As of Unicode 13, there are nearly 150_000 characters encoded across the 2_097_152 available in the 21 bit encoding space. There are over 80 character properties. Storing and accessing this data in an efficient and up to date way is not trivial. However, any conformant implementation, especially one which includes Unicode regular expressions, already has all of this data, available via implementations such as ICU.

Why not just use RegEx?

In a way, getting a property is the inverse of Unicode Regular Expressions.

/\p{gc=Lowercase_Letter}/u.test('и')
// implies:
"и".getUnicodeProperty("Gc") === 'Lowercase_Letter'

If all that is needed is matching, certainly a regex could be used, especially for a boolean operation.

/\p{Lower}/u.test('e') === "e".getUnicodeProperty("Lower") // both true 
/\p{Lower}/u.test('E') === "E".getUnicodeProperty("Lower") // both false

However, for classifying (as in segmentation) or analyzing (as in number parsing), this becomes unwieldy.

     if(/\p{NumericValue=0}/u.test('٢')) { digit = 0; } // false
else if(/\p{NumericValue=1}/u.test('٢')) { digit = 1; } // false
else if(/\p{NumericValue=2}/u.test('٢')) { digit = 2; } // true
else if(/\p{NumericValue=4}/u.test('٢')) { digit = 3; } // false
…

// vs:
digit = '٢'.getUnicodeProperty('nv') // 2

This could be used to convert ١٢٣٬٤٥٦ into Number(123.456)

(this property was not supported by the JS engine I tested.)

Need to calculate the Sentence_Break property value for each character:

     if(/\p{Sentence_Break=Extend}/u .test('q')) { … } // false
else if(/\p{Sentence_Break=Lower}/u  .test('q')) { … } // true
else if(/\p{Sentence_Break=OLetter}/u.test('q')) { … } // false
else if(/\p{Sentence_Break=STerm}/u  .test('q')) { … } // false
…
// vs:
'q'.getUnicodeProperty('Sentence_Break') // 'Lower'

(this property was not actually supported by the JS engine I tested.)

(For performance reasons, an application may actually want to get the properties of each codepoint in a string, and not need to make multiple calls. See the issues for discussion.)

History

https://github.com/tc39/ecma402/issues/90