Spec: https://srl295.github.io/es-unicode-properties/
Allows for a function to return Encoded Character Properties for a code point.
For applications, they can directly answer questions such as “What kind of script is 𞤘?”, “Is ġ lowercase? ”, or “What is the numeric value of ५?”.
For feature implementers, this is a required building block for implementing a wide array of higher level features, such as number parsing, segmentation, regular expressions, and much more.
A property (for this purpose) is string (including enumerated types), a number, or a boolean.
и
(U+0438) ICU UBrowse𞤘
(U+1E918) ICU UBrowseġ
(U+0121)५
(U+096B)CP | Name | Long Name | Value | Long Value | Comments |
---|---|---|---|---|---|
и |
Gc |
General_Category |
Ll |
Lowercase_Letter |
Enumeration |
𞤘 |
sc |
Script |
Adlm |
Adlam |
Enumeration |
ġ |
Lower |
Lowercase |
true | true | Boolean |
५ |
nv |
General_Category |
5 | 5 | Number |
Note: see Issues for further discussion.
"и".getUnicodeProperty("Gc", {type: "short"}) // "Ll"
"и".getUnicodeProperty("Gc", {type: "long"}) // "Lowercase_Letter"
"и".getUnicodeProperty("Gc") // "Lowercase_Letter" - type:long is default
"и".getUnicodeProperty("General_Category") // "Lowercase_Letter" - "Gc" ≈ "General_Category"
Data Size, Complexity, Performance, Updates.
As of Unicode 13, there are nearly 150_000 characters encoded across the 2_097_152 available in the 21 bit encoding space. There are over 80 character properties. Storing and accessing this data in an efficient and up to date way is not trivial. However, any conformant implementation, especially one which includes Unicode regular expressions, already has all of this data, available via implementations such as ICU.
In a way, getting a property is the inverse of Unicode Regular Expressions.
/\p{gc=Lowercase_Letter}/u.test('и')
// implies:
"и".getUnicodeProperty("Gc") === 'Lowercase_Letter'
If all that is needed is matching, certainly a regex could be used, especially for a boolean operation.
/\p{Lower}/u.test('e') === "e".getUnicodeProperty("Lower") // both true
/\p{Lower}/u.test('E') === "E".getUnicodeProperty("Lower") // both false
However, for classifying (as in segmentation) or analyzing (as in number parsing), this becomes unwieldy.
١٢٣٬٤٥٦
” into numeric form: if(/\p{NumericValue=0}/u.test('٢')) { digit = 0; } // false
else if(/\p{NumericValue=1}/u.test('٢')) { digit = 1; } // false
else if(/\p{NumericValue=2}/u.test('٢')) { digit = 2; } // true
else if(/\p{NumericValue=4}/u.test('٢')) { digit = 3; } // false
…
// vs:
digit = '٢'.getUnicodeProperty('nv') // 2
This could be used to convert ١٢٣٬٤٥٦
into Number(123.456)
(this property was not supported by the JS engine I tested.)
Need to calculate the Sentence_Break
property value for each character:
if(/\p{Sentence_Break=Extend}/u .test('q')) { … } // false
else if(/\p{Sentence_Break=Lower}/u .test('q')) { … } // true
else if(/\p{Sentence_Break=OLetter}/u.test('q')) { … } // false
else if(/\p{Sentence_Break=STerm}/u .test('q')) { … } // false
…
// vs:
'q'.getUnicodeProperty('Sentence_Break') // 'Lower'
(this property was not actually supported by the JS engine I tested.)
(For performance reasons, an application may actually want to get the properties of each codepoint in a string, and not need to make multiple calls. See the issues for discussion.)