mathiasbynens / emoji-test-regex-pattern

A regular expression pattern for Java/JavaScript to match all emoji in the emoji-test.txt file provided by UTS#51.
MIT License
98 stars 17 forks source link

UTC: propose exposing emoji-test.txt as a Unicode property of strings #7

Open mathiasbynens opened 3 years ago

mathiasbynens commented 3 years ago

It would be amazing if Unicode would expose all emoji-test.txt strings as a property of strings.

That, in combination with property-of-strings support in regular expressions, would reduce the need for this repository in the long term in favor of a simple, straight-forward regular expression pattern of the form /\p{EmojiTest}/v (property name TBD).

It could even be an enumerated property, to provide the full info, e.g.

\p{Emoji_Qualification=full}

Values could be full, minimal, unqualified, or na. emoji-test.txt could then be generated from that property.

Ref. https://github.com/node-unicode/node-unicode-data/issues/63

macchiati commented 2 years ago

A bit of progress. Presented https://www.unicode.org/L2/L2022/22160-rgi-emoji-qual.pdf at the UTC. No agreement on adding yet (too late for Unicode 15.0), but will make revised version for next time.

yisibl commented 4 days ago

Hi @macchiati

Unicode 16.0 has been released, any news on this proposal?

macchiati commented 3 days ago

Markus can give more details, but I think the biggest noticeable change (aside from additions) for implementations will be when ICU releases, with collation.

On Fri, Sep 27, 2024, 01:47 一丝 @.***> wrote:

Hi @macchiati https://github.com/macchiati

Unicode 16 has been released, is there anything new here?

— Reply to this email directly, view it on GitHub https://github.com/mathiasbynens/emoji-test-regex-pattern/issues/7#issuecomment-2378772590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCGWMEOXEISAGVYGHLZYULRJAVCNFSM6AAAAABO6WEVQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZYG43TENJZGA . You are receiving this because you were mentioned.Message ID: @.***>

macchiati commented 3 days ago

See also https://blog.unicode.org/2024/09/unicode-cldr-46-beta-available-for.html

On Fri, Sep 27, 2024, 09:00 Mark Davis Ⓤ @.***> wrote:

Markus can give more details, but I think the biggest noticeable change (aside from additions) for implementations will be when ICU releases, with collation.

On Fri, Sep 27, 2024, 01:47 一丝 @.***> wrote:

Hi @macchiati https://github.com/macchiati

Unicode 16 has been released, is there anything new here?

— Reply to this email directly, view it on GitHub https://github.com/mathiasbynens/emoji-test-regex-pattern/issues/7#issuecomment-2378772590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCGWMEOXEISAGVYGHLZYULRJAVCNFSM6AAAAABO6WEVQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZYG43TENJZGA . You are receiving this because you were mentioned.Message ID: @.***>

markusicu commented 3 days ago

To answer @yisibl’s actual question...

Unicode 16.0 has been released, any news on this proposal?

UTS 51 now defines ED-28. RGI_Emoji_Qualification — the status of emoji sequences This is an enumerated property of strings, defined by the emoji-test.txt file ... ... The property value names and short aliases are:


I haven't thought about this for a while... This would be the first enumerated property of strings in ICU.

Looking at https://www.unicode.org/Public/emoji/16.0/emoji-test.txt, the file actually has four status values, including “component”, which is not listed in the UTS 51 definition.

@macchiati can you elaborate on why UTS51 defines three values but the data file has four? Is “component” intentionally omitted?

I guess I need to start thinking about how I represent this property in ICU. I just created https://unicode-org.atlassian.net/browse/ICU-22931

@mathiasbynens I also guess that you would like me to implement this for one of the 2025 ICU releases...? You might help me justify this for annual planning.