tc39 / ecma402

Status, process, and documents for ECMA 402
https://tc39.es/ecma402/
Other
524 stars 102 forks source link

Unicode Properties #90

Open srl295 opened 8 years ago

srl295 commented 8 years ago

https://github.com/srl295/es-unicode-properties

srl295 commented 8 years ago

cc @mathiasbynens

littledan commented 8 years ago

Something that's been discussed is exposing these to RegExps. V8 does this currently behind a special flag, thanks to @hashseed's work. I don't know if a spec is written but I heard @goyakin may work on exposing properties through RegExps.

hashseed commented 8 years ago

As @littledan mentioned, this is an experimental feature in V8. The comment in the regexp parser describes the current syntax we are using:

  // Parse the property class as follows:
  // - \pN with a single-character N is equivalent to \p{N}
  // - In \p{name}, 'name' is interpreted
  //   - either as a general category property value name.
  //   - or as a binary property name.
  // - In \p{name=value}, 'name' is interpreted as an enumerated property name,
  //   and 'value' is interpreted as one of the available property value names.
  // - Aliases in PropertyAlias.txt and PropertyValueAlias.txt can be used.
  // - Loose matching is not applied.

For example /\p{East_Asian_Width=H}/u.test("\u20a9") // true

\P is the inverse of \p, so binary properties with "False" as property value can be expressed via \P. For example /\p{ASCII_Hex_Digit}/u.test("A") // true /\P{ASCII_Hex_Digit}/u.test("A") // false

mathiasbynens commented 8 years ago

For the record, the V8 flag @littledan mentioned is --harmony_regexp_property. Tests that show how the current implementation works: https://chromium.googlesource.com/v8/v8/+/master/test/mjsunit/harmony/regexp-property-exact-match.js


Is full compatibility with existing \p implementations a hard requirement? If I were implementing \p{…} in ES I explicitly wouldn’t support Is/In prefixes, shorthands, loose matching, property aliases, property value aliases, or whitespace around = / :. E.g. throw on /\\p{In_Cyrillic_Sup}/u, /\\p{Block=Cyrillic_Sup}/u and /\\p{Block=Cyrillic_Supplementary}/u and only accept /\\p{Block=Cyrillic_Supplement}/u which is the canonical block name. We have the opportunity to be strict here and encourage readable code; let’s do it.


Some related info: https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2

@goyakin Can we track your spec work somewhere (GitHub)?

hashseed commented 8 years ago

We considered doing loose matching and having a "In"-prefix for blocks. But having thought about it, we decided against either. Looking at Perl, it seems to be a good idea to be strict rather than overly ambiguous. Your example would be /\p{Block=Cyrillic_Supplement}/u or /\p{blk=Cyrillic_Sup}/u. Reason to have the property name be explicit is because there is ambiguity between Script and Block property value names. And honestly stating it explicitly really should not hurt anyone.

mathiasbynens commented 8 years ago

@hashseed Agreed; that is what the discussion I referenced concluded as well. I’ve now updated my example (originally intended to explain how aliases should throw only) to avoid confusion.

Note that in your example you’re still doing a form of loose matching, i.e. ignoring _. (The canonical block name is Cyrillic Supplement and not Cyrillic_Supplement.)

hashseed commented 8 years ago

I thought the underscore is actually part of the name. That's what PropertyAlias.txt and PropertyNameAlias.txt as well as ICU suggest.

mathiasbynens commented 8 years ago

As far as I can see, only PropertyValueAliases.txt suggests it. Blocks.txt has the block name with spaces instead of underscores. I’ve asked for clarification here: http://www.unicode.org/mail-arch/unicode-ml/y2016-m05/thread.html#79

mathiasbynens commented 8 years ago

I hope to present this as a stage 0 strawman at a future TC39 meeting.

After implementing support for \p{…} and \P{…} in my regular expression transpiler https://github.com/mathiasbynens/regexpu-core (online demo), I’ve started to work on a concrete spec proposal. Here’s an early draft: https://github.com/mathiasbynens/ecma262/pull/1 Feedback welcome.

hashseed commented 8 years ago

Thanks for following up on this, Mathias!

Having followed the unicode mail thread, I think I can get behind the idea of considering whitespace, hyphens and underscores as equivalent, when looking up property names and property value names including their aliases.

E.g. \p{Lowercase Letter} would be allowed just as well as \p{Lowercase-Letter} and p{Ll}, but not \p{Lower case Letter}.

This would solve the conflict between Blocks.txt and PropertyValuaAliases.txt.

mathiasbynens commented 8 years ago

@hashseed There is another issue though: e.g. Blocks.txt has Superscripts and Subscripts, whereas PropertyValueAliases.txt has Superscripts_And_Subscripts, which is the canonical property value. Note the difference in casing of the letter a. To support \p{Block=Superscripts and Subscripts} in addition to \p{Superscripts_And_Subscripts} we need case-insensitivity as well.

Would you be open to that, or would you rather stick to strict matching in that case?

srl295 commented 8 years ago

@mathiasbynens — thanks for your work on this. What's puzzling to me is why Blocks.txt is even being looked at here. It's for display names, not programmatic use. PropertyValueAliases.txt is the right place to find property value aliases — just as the response on the mailing list said.

I'm a definite -1 on leniency to match Blocks.txt — that's not what it's for. We should just match PropertyValueAliases.txt

mathiasbynens commented 8 years ago

@srl295 It may not be what it’s for, but it would be a direct consequence of following http://unicode.org/reports/tr18/#RL1.2 which specifies that “matching of […] values must follow the Matching Rules from UAX44”, specifically http://unicode.org/reports/tr44/#Matching_Symbolic. (As stated, I’d be fine with not following that, and implementing strict matching instead — just explaining the reasoning here.)

What's puzzling to me is why Blocks.txt is even being looked at here. It's for display names, not programmatic use.

Yeah, that’s what I didn’t know when I started the thread. I’d be willing to bet that there are other developers wishing to use \p{…} in regexps that don’t know about this. Blocks.txt doesn’t seem like an illogical place to go looking for the proper block names, IMHO. Those devs would be surprised to find that \p{Block=Superscripts and Subscripts} doesn’t work. It doesn’t help that Blocks.txt also includes this:

# Note:   When comparing block names, casing, whitespace, hyphens,
#         and underbars are ignored.
#         For example, "Latin Extended-A" and "latin extended a" are equivalent.
#         For more information on the comparison of property values, 
#            see UAX #44: http://www.unicode.org/reports/tr44/

This is a problem that can be solved through proper developer documentation, of course. But taking all of it into consideration, I’m leaning towards supporting @hashseed’s suggestion + case-insensitivity.

srl295 commented 8 years ago

@mathiasbynens UAX44-LM2 is of course a great reason to, quote, Ignore case, whitespace, underscore (‘’), and all medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E_, unquote. So I'm +1 on that.

\p{Block=Superscripts and Subscripts} doesn't work

But, it should work (and does in ICU )— because of UAX44-LM2. Are there any names in Blocks.txt that wouldn't match PropertyValueAliases.txt given the leniency?

It does seem that both the Blocks.txt comment and UAX44 could be improved for some more clarity — discussing PropertyValueAliases.txt

mathiasbynens commented 8 years ago

@srl295 Have you seen https://github.com/mathiasbynens/ecma262/pull/1#discussion_r65918515? It was the context for the above discussion.

But, it should work — because of UAX44-LM2.

Sure — if we decide to follow that. My initial spec draft included a variant of loose matching per UAX44-LM2 (minus non-ASCII hyphens and the is prefix) but we later decided to use strict matching instead.

Are there any names in Blocks.txt that wouldn't match PropertyValueAliases.txt given the leniency?

No. But note that this is also true for @hashseed’s suggestion combined with case-insensitivity (which is what I was proposing here), which would be a more strict solution than UAX44-LM2. I’d strongly prefer that over UAX44-LM2, at least for the initial spec text + implementations. We can always loosen up the matching algorithm later, but if we do it right from the start, there’s no going back.

hashseed commented 8 years ago

If matching Blocks.txt is not really that important, I'm actually hesitant to follow UAX44-LM2 at all, including whitespace and underscore. If we simply follow UAX44-LM2, we end up with loose matching, which I thought we agreed on being a bad idea. The reason I mentioned for this is that we do not want to end up with regexps that read /\p{___lower C-A-S-E___}/ui. I don't see why we should carve out a subset of UAX44-LM2 instead of ignoring it altogether.

If we consider following UAX44-LM2 a bad idea, and there is no reason to care about matching Blocks.txt, then I'm in favor being super strict and only match what's listed in PropertyValueAliases.txt. We can explicitly state that in the spec text, and add a note about Blocks.txt. I think standardizing on underscore as separator is nicer than having this exception for Blocks.txt. Scripts.txt for example use names with underscore. You could argue that either way could surprise users.

With that in place, we can still gather feedback from developers. If not having loose matching is an actual developer pain point, we can still address that in a future PR.

mathiasbynens commented 8 years ago

Updated https://github.com/mathiasbynens/ecma262/pull/1/files to explicitly mention PropertyAliases.txt & PropertyValueAliases.txt.

mathiasbynens commented 8 years ago

There is now a standalone repo for this proposal: https://github.com/mathiasbynens/es-regex-unicode-property-escapes Let’s move the discussion over there.

srl295 commented 8 years ago

@mathiasbynens OK. I think there's a lot more needed than just support within regex (as important as that is), especially getting the general property and other properties given a codepoint.

in ICU there's uchar_getIntProperty so

 … =  uchar_getIntProperty( 'A', UCHAR_GENERAL_CATEGORY); // == U_UPPERCASE_LETTER (Lu)
 … =  uchar_getIntProperty( 'A', UCHAR_SCRIPT); // USCRIPT_LATIN (Latn)

etc. Not proposing this specific API, just trying to get the concept rolling.

littledan commented 8 years ago

@srl295 Are there use cases that you have in mind where it is important to use the property value, rather than test whether the character has a particular property value? That would help motivate adding such an API.

srl295 commented 8 years ago

@littledan sure, anything that's not just a single boolean:

sure, you could do

if ( /\p{Gc=Lo}/.test('A') ) { 
  …
} else if ( /\p{Gc=Lm}/.test('A') ) { 
  …
} else if ( /\p{Gc=Mc}/.test('A') ) { 
  …
}

… but why?

Actually I would prefer making the property available over extending regex. Because if you have the properties, you can implement regex in JS. But without the properties enumerated, it's a lot harder to do the reverse.

hashseed commented 8 years ago

@srl295 I don't think exposing a way to test for property value for a particular character should affect this proposal.

jungshik commented 7 years ago

@srl295 Are there use cases that you have in mind where it is important to use the property value, rather than test whether the character has a particular property value? That would help motivate adding such an API.

Let's suppose that there's such a use case. Even then, wouldn't it better to make that API a part of Ecma 262 instead of Ecma 402 (given what has been added to Regex) ?

littledan commented 7 years ago

It's a somewhat esoteric question which place this lands in; the 262/402 split doesn't correspond to the split in some implementations. For example, V8 does not support normalization or Unicode RegExp properties when "i18n" is compiled out. I suspect it's not the only one.

A rough argument for putting it in 402 is, this is where the library functions for things that aren't methods on existing objects go. And it seems reasonable to make this a property of the Intl object.

jungshik commented 7 years ago

For example, V8 does not support normalization or Unicode RegExp properties when "i18n" is compiled out. I suspect it's not the only one.

Well, the current 'V8_INTL_SUPPORT' needs to be split into two eventually or its 'boundary' has to be changed, IMHO once https://bugs.chromium.org/p/v8/issues/detail?id=5500#c9 (replace unibrow with ICU) is resolved. One should be about Ecma402 support (Intl. API support) and the other should be about whether ICU is used or not (ICU vs unibrow). Depending on the above V8 bug is resolved, the latter would not be necessary at all (i.e. ICU is always used) in which case Unicode RegExp properties (a part of Ecma262) would always be supported regardless of Intl. API (Ecma 402) support.

littledan commented 6 years ago

For anyone who's looking to contribute to ECMA 402, this is a "shovel ready" project, just in need of a writeup for a concrete API, and presentation to the committee.

srl295 commented 4 years ago

OK so given the current status of regex properties probably a good way forward on this is something like:

"и".getUnicodeProperty("Gc", {type: "short"}) // "Ll"
"и".getUnicodeProperty("Gc", {type: "long"}) // "Lowercase_Letter"
"и".getUnicodeProperty("Gc") // "Lowercase_Letter"  - type:long is default
"и".getUnicodeProperty("General_Category") // "Lowercase_Letter" - "Gc" ≈ "General_Category"

and this is kind of the inverse of regex:

/\p{gc=Ll}/u.test('и') // true
sffc commented 4 years ago

I like the looks of this. The obvious question is, what happens when you pass a multi-code-point string? RangeError, or array of properties with length equal to the code point length of the string?

"иии".getUnicodeProperty("Gc", {type: "short"});  // ["Ll", "Ll", "Ll"]

This would also have the (positive or negative) side-effect of giving a method to get the code point length of the string without using a code point iterator.

String.prototype.codePointCount = function() {
  return this.getUnicodeProperty("Gc").length;
}
srl295 commented 4 years ago

perhaps replace the proposal above with "Aa3".getUnicodePropertyAt(1, "gc") // "Lowercase_Letter" and make the index optional (as with charCodeAt).

So minimally: "и".getUnicodeProperty("gc"), maximally "и".getUnicodeProperty(0, "gc", {type: "long"})

But I could see use cases (such as text classification) where returning an array of values could be useful.

srl295 commented 4 years ago

as to 262/402: if an implementation can support /\p{gc=Ll}/u.test('и') it could support my proposed API, so it seems like they can go together, feature wise.

leobalter commented 4 years ago

I like the looks of this. The obvious question is, what happens when you pass a multi-code-point string? RangeError, or array of properties with length equal to the code point length of the string?

I have some preference to always return an array containing properties objects for each entry of the string.

getUnicodePropertyAt can be added as inclusive, not exclusive in this case.

Also, perhaps:

In this case, the char indice might come last to allow a default to 0:

"и".getUnicodeProperty("Gc", {type: "long", index: 0}); // no index prop === index = 0

// or

"и".getUnicodeProperty("Gc", {type: "long"}, 0); // (no arg === undefined, casts to 0)

"0и".getUnicodeProperty("Gc", {type: "long"}, 1);

I prefer the index as a property option rather than an argument, to avoid extra magic as allowing only 2 args when the last one is index

"0и".getUnicodeProperty("Gc", 1); // if not an object, use the value as index

// I prefer for consistency:

"и".getUnicodeProperty("Gc", {index: 0});
srl295 commented 4 years ago

Perhaps we just drop the index completely.

"и".getUnicodeProperty("Gc") -> // [ "Lowercase_Letter" ]
"и!".getUnicodeProperty("Gc") -> // [ "Lowercase_Letter", "Other_Symbol" ]
"и!".getUnicodeProperty("Gc", {type: "short"}) -> // [ "Ll", "So" ]

The caller can always subset the input string, or output array, of getUnicodeProperty().

mathiasbynens commented 4 years ago

I like the ideas presented here so far! Why String.prototype.* and not a static Intl.* method? (Not necessarily opposed — just curious since Intl.* seemed more natural to me.)

srl295 commented 4 years ago

@mathiasbynens well re https://github.com/tc39/ecma402/issues/90#issuecomment-540710261 it's more related to regex or String.toLocaleLowerCase or String.normalize or getCodepointAt than anything else in Intl.

leobalter commented 4 years ago

The only possible blocker I see for a prototype method is overloading the object.

I still prefer it as this method does not require a specific locale identifier, in fact, this is much more related to a general JS functionality that could even be a candidate to ECMA-262.

I don't have any strong preference on how the method signature should be, but I'm tending to agree with the proposed solutions from @srl295 here.

srl295 commented 4 years ago

The only possible blocker I see for a prototype method is overloading the object.

what does that mean?

leobalter commented 4 years ago

Implementors might prefer not adding new methods to String.prototype or other built-ins. Even though, I believe it's a reasonable addition considering the use case.

sffc commented 4 years ago

For the various string prototype methods, I think an interesting design question is how we handle supplementary code points. Some various options:

  1. Output one property per code point (length of string != length of property array)
  2. Output the property twice for supplementary code points
  3. Output the property followed by null

A related question is, how does a user correctly get the property for a specific code point in a string? If you want to get the property for the 3rd code point, you have to figure out what index corresponds to the 3rd code point. Although, maybe in practice this is not a common case since users are mostly interested in strings with exactly one code point.

srl295 commented 4 years ago

This should be in code point space, definitely.

  1. Output one property per code point (length of string != length of property array)

Right.

Hmm, there isn't a String.codePointLen is there? Perhaps there should be.

A related question is, how does a user correctly get the property for a specific code point in a string? If you want to get the property for the 3rd code point, you have to figure out what index corresponds to the 3rd code point. Although, maybe in practice this is not a common case since users are mostly interested in strings with exactly one code point.

Hm, because "𞤫".length = 2.

Maybe there are some other missing functions:

If you pass in a high or low surrogate, you'll get the properties on the surrogate. But that shouldn't be the normal case.

mathiasbynens commented 4 years ago

Hmm, there isn't a String.codePointLen is there? Perhaps there should be.

There's a lot of history here. See e.g. https://esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string and more recently https://esdiscuss.org/topic/proposal-string-prototype-codepointcount.

The current TL;DR is that [...string].length achieves this and so (IMHO) there's no need for an explicit codePointLength.


Is it time to graduate (in a good way!) this issue thread to a fully-fledged GitHub repository of its own? It feels like there could be separate issue threads for each problem that is being discussed here.

srl295 commented 4 years ago

Hmm, there isn't a String.codePointLen is there? Perhaps there should be.

There's a lot of history here.

I have some of the history in my desk drawer…

See e.g. https://esdiscuss.org/topic/how-to-count-the-number-of-symbols-in-a-string and more recently https://esdiscuss.org/topic/proposal-string-prototype-codepointcount.

https://esdiscuss.org/topic/proposal-string-prototype-codepointcount#content-7 makes the case for the character segmenter.

The current TL;DR is that [...string].length achieves this and so (IMHO) there's no need for an explicit codePointLength.

Except that it doesn't, it counts code units rather than codepoints. This remains a hole… and, without resolution, may lean this API towards operating on a single codepoint at a time. Getting the property on a surrogate is not going to be expected behavior for most people, such as "\uD83A".getUnicodeProperty("Gc").

Is it time to graduate (in a good way!) this issue thread to a fully-fledged GitHub repository of its own? It feels like there could be separate issue threads for each problem that is being discussed here.

Yeah, I think so. How do we (or I) do that?

mathiasbynens commented 4 years ago

The current TL;DR is that [...string].length achieves this and so (IMHO) there's no need for an explicit codePointLength.

Except that it doesn't, it counts code units rather than codepoints.

What makes you say that? string.length counts UTF-16 code units, but [...string].length does count code points.

Is it time to graduate (in a good way!) this issue thread to a fully-fledged GitHub repository of its own? It feels like there could be separate issue threads for each problem that is being discussed here.

Yeah, I think so. How do we (or I) do that?

For new proposals, I tend to create a new repository (could be anywhere, e.g. under your own account) and start fleshing out a README with the stage 1 sections (Status, Motivation, Proposed solution, High-level API, FAQ) + filing issues for any open questions.

ljharb commented 4 years ago

@srl295 spread uses the string iterator, which produces an array of code points, not units.

srl295 commented 4 years ago

The current TL;DR is that [...string].length achieves this and so (IMHO) there's no need for an explicit codePointLength.

Except that it doesn't, it counts code units rather than codepoints.

What makes you say that? string.length counts UTF-16 code units, but [...string].length does count code points.

OK. I have never used "...", I thought you were saying "some string" . length, so I misunderstood what was being achieved. My mistake and lack of ES! When I realized that ... was part of the syntax, yes, it does produce the length in code units. Hey, that works.

> [..."𞤫"]
[ '𞤫' ]
> [..."𞤫"].length
1
> [..."e𞤫"].length
2
> [..."e𞤫𞤫"].length
3
> [..."𞤫"]
[ '𞤫' ]
> ['𞤫'].length
1
> ['𞤫𞤫'].length
1
> typeof ( [..."e"] )
'object'

Is it time to graduate (in a good way!) this issue thread to a fully-fledged GitHub repository of its own? It feels like there could be separate issue threads for each problem that is being discussed here.

Yeah, I think so. How do we (or I) do that?

For new proposals, I tend to create a new repository (could be anywhere, e.g. under your own account) and start fleshing out a README with the stage 1 sections (Status, Motivation, Proposed solution, High-level API, FAQ) + filing issues for any open questions.

OK great. Is there a template repo?

ljharb commented 4 years ago

https://github.com/tc39/template-for-proposals (that’s for 262, not sure if 402 has something different)

srl295 commented 4 years ago

in practice, does [..."e𞤫𞤫"].length get optimized somehow, to where it doesn't need actually iterate and such? just curious.

srl295 commented 4 years ago

https://github.com/srl295/es-unicode-properties

srl295 commented 4 years ago

https://github.com/tc39/template-for-proposals (that’s for 262, not sure if 402 has something different)

this may end up being 262…

mathiasbynens commented 4 years ago

in practice, does [..."e𞤫𞤫"].length get optimized somehow, to where it doesn't need actually iterate and such? just curious.

https://v8.dev/blog/spread-elements

leobalter commented 4 years ago

I believe this is much more convenient in 262 even w/ the fact it involves a good amount of work from the delegates working w/ ECMA-402.

As long as we have someone to champion it in a TC39 meeting, this should be well clarified.