golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.14k stars 17.46k forks source link

proposal: unicode: add emoji properties #45264

Closed gudvinr closed 3 years ago

gudvinr commented 3 years ago

What version of Go are you using (go version)?

$ go version
go version go1.16.2 linux/amd64

What do you propose?

There are number of range tables in unicode package of stdlib which define some of character properties from Unicode Character Database.

Unicode also has additional sets of properties besides ones defined in core standard. These properties described in technical reports.
Notably, UTS#51 defines sets of properties to determine which unicode characters are emojis:

Character property Regional_Indicator already present in unicode package.

Data source

At the time of writing, go1.16 contains range tables from Unicode 13.0.0.
Thus, properties for emoji data also should be taken from UCD emoji data 13.0.0.

Package changes

New RangeTable variables (order follows emoji-data.txt):

Inclusion of functions for checking character properties like IsEmoji, IsEmojiModifier, IsEmojiModifierBase, etc doesn't make a lot of sense since there's already unicode.In function.
However, some kind of function in form of IsEmojiData that checks range tables for all emoji-related properties might be useful to e.g. filter out all emoji components from text.

To make these properties usable in regexp package, their names (or corresponding abbreviations) should be included into Categories or Scripts.

Additional notes

Although UTS#51 defines emoji sequences, this issue does not cover this topic since emoji sequence consists of multiple characters and unicode package doesn't have a concept of "character sequence".

Examples in other languages

seankhliao commented 3 years ago

cc @robpike

ianlancetaylor commented 3 years ago

CC @mpvl

robpike commented 3 years ago

I suspect this would be better done in a separate package, probably not in the standard library, as I believe that the data set will grow substantial over time and most programs won't need it.

Kimwing222 commented 3 years ago

40724

Duplicate of @ @

robpike commented 3 years ago

That's not the right issue number for the duplicate, or else not the right issue for that duplicate number.

smasher164 commented 3 years ago

I agree that this should be done outside the stdlib. The Unicode technical reports (UAX, UAS, UTR) make it clear that they are independent specifications, and that conformance to the Unicode standard does not imply conformance to the technical report. It's also unclear if emoji is all we add, and not one of the other reports, like identifiers, script properties, etc.

I will say however, that the current API for constructing range tables is not very ergonomic. I ended up forgoing efficiency and used functions just so I had access to set union and set difference when combining range tables: https://github.com/smasher164/xid/blob/560c18f776900eb8c8b061d155309097f2f68545/xid.go.

gudvinr commented 3 years ago

smasher164: The Unicode technical reports (UAX, UAS, UTR) make it clear that they are independent specifications, and that conformance to the Unicode standard does not imply conformance to the technical report

This isn't exactly true. While UTS and UTR indeed aren't required to implement, Unicode Standard Annex, however, might be required by Standard. There's a list of such in Chapter 3 of Standard.

robpike: I suspect this would be better done in a separate package, probably not in the standard library smasher164: I agree that this should be done outside the stdlib

As pointed out, UTS isn't part of unicode standard, so I think this is quite reasonable to not put emoji stuff in unicode.
While natural languges do not usually grow over time, emoji data will grow and forcing people to import data they don't want doesn't look good.

Although I feel like at least range tables (without emoji sequences and other trickery) should be somewhere close to unicode to keep those tables in sync with rest of unicode tables. There's couple of reasons to do so:

smasher164 commented 3 years ago

I agree, I've often wanted unicode to be versioned separately from the standard library, especially when it came to keeping these properties in sync with newer versions of unicode. I ended up using build tags to hack around that:

unicodeTestVersion_114.go:

// +build go1.14,!go1.16

package xid

const unicodeTestVersion = "12.1.0"

unicodeTestVersion_116.go:

// +build go1.16

package xid

const unicodeTestVersion = "13.0.0"
mpvl commented 3 years ago

A natural place to put these properties would be x/text/unicode/emoji.

This repo already has a runenames package, which could naturally hold emoji names as well (although I'm unsure how this fits with emoji sequences).

The argument for putting it in core so it can be used in regexp is valid. However, this would mean including more tables in core that are currently missing. A more reasonable solution to support emoji in regexp would be to allow user-defined character classes, allowing users to add classes from x/text, for instance.

It should also be mentioned what the goal is of these tables. Depending on the application, rangetables may not be the best representation. Judging from UTS #51, for instance, a UTF-8 trie, which allows associating a set of related properties with a single rune, seems more appropriate. The x/text repo has all the infrastructure in place to generate such tries conveniently.

mpvl commented 3 years ago

@smasher164: the x/text repo uses a similar trick. The generators are multi-version aware and will automatically add to/modify build tags for generated tables.

It uses it, however, to ensure that the versions align with the latest Go. Your comment, however, suggest that you would want the other way around: have Go adopt a later version. This gives rise to the idea that core could use tables from x/text directly. Core already uses x/text for various packages and x/text already generates the tables for core. So if instead core would use the tables from x/text, x/text could advance the unicode version ahead of core, while ensuring consistency between packages.

This obviously would require a separate proposal. There are some serious implications for this. Also, there are packages with hardcoded range tables. But all this could be worked around.

gudvinr commented 3 years ago

I don't want to be "that guy" but which policies are applied for decisions to where include what? Earlier it was said that "conformance to the Unicode standard does not imply conformance to the technical report".
Does it mean that something that conforms to the Unicode Standard should be (or at least considered worthy) included in stdlib?

@mpvl I see that x/text/runenames already contains names for everything from UCD. That includes emojis. UCD essentially is UAX#41. From the document I mentioned earlier it is clear that UAX#41 "is considered part of Version 13.0 of the Unicode Standard" yet it not included in stdlib. Same for UAX#15 (normalization) and UAX#9 (bidirectional). But UAX#38 (unihan) is a part of stdlib unicode package.

While x/text/unicode contains number of annexes included in Standard, it uses different unicode version which makes it somewhat incompatible with stdlib.

Despite of UTS being "independent specification" by definition, Standard itself clearly mentions that only a few UTS synchronized with its version: UTS#10 (collation), UTS#39 (security), UTS#46 (idna) and UTS#51 (emoji).
It may be fine to be compatible for earlier version for tool packages but very frustrating for packages that contain databases. That includes emojis and runenames too.

For example, conformance to Unicode 13 means that it should contain "Khitan Small Script". And unicode indeed does. But not runenames. Also in documentation of runenames link to UCD points to "latest" version but tables are from 12.0.0.
I can't say that Khitan is popular language but that made me thinking that unicode-related packages now in a little bit messy state.

However, this would mean including more tables in core that are currently missing

Since other tables probably represented by other Technical Reports, addition of single one doesn't imply that other ones should be included too. These are still independent specs.

A more reasonable solution to support emoji in regexp would be to allow user-defined character classes, allowing users to add classes from x/text, for instance.

I agree that it should be more flexible way to do so. Although I don't think that I would be able to write a proper proposal for that.

It should also be mentioned what the goal is of these tables.

I can't say for others but I ended up in a situation where I need to be aware about emojis in text. First, to correctly remove such characters from text or replace them with non-graphical representations (e.g. using names from UCD). Second, to count number of characters when single emoji or emoji sequence represents single "character".
It seemed that range table and regexp support should be sufficient enough and already used by go library to represent language scripts/

mpvl commented 3 years ago

While x/text/unicode contains number of annexes included in Standard, it uses different unicode version which makes it somewhat incompatible with stdlib.

The stdlib tables are generated from x/text. Core even depends on x/text and build tags in x/text ensure that the Unicode version of x/text is matched to that of core. So tip of x/text is ahead in Unicode version compared to core.

For example, conformance to Unicode 13 means that it should contain "Khitan Small Script". And unicode indeed does. But not runenames. Also in documentation of runenames link to UCD points to "latest" version but tables are from 12.0.0.

That seems like a bug in runenames' generate script if true. It should update automatically with a Unicode upgrade. @nigeltao.

smasher164 commented 3 years ago

@mpvl

A more reasonable solution to support emoji in regexp would be to allow user-defined character classes, allowing users to add classes from x/text, for instance.

I could imagine an API like

func RegisterClass(name string, table *unicode.RangeTable)

in either regexp or regexp/syntax. Or if it needed to be scoped per *regexp.Regexp, an alternative constructor like

func WithClass(name string, table *unicode.RangeTable, expr string) (*Regexp, error)

Either way, this would be a separate proposal.

Your comment, however, suggest that you would want the other way around: have Go adopt a later version.

I could imagine the stdlib being behind the supported version in x/text. That way, for example, someone who wanted to use unicode 13 functionality on Go 1.15 could simply import x/text/unicode.


@gudvinr

Maybe the way forward here is to either define these properties in x/text, and file a proposal for regexp?

mpvl commented 3 years ago

@gudvinr

At the same time, x/text/unicode uses different version scheme and has a separate release schedule.

Core Unicode tables are generated from x/text and core even imports x/text for various use cases, like normalization. Also, the x/text tables use build tags to keep these tables in sync. It's a bug for core Unicode packages in x/text to not be updated to the right version.

Theoretically, core could refer to x/text for all its tables, which would allow getting rid of the build tag trick and would allow using newer Unicode versions independently from the Go version. That needs some serious thought and some adjustment to existing packages like strconv IIRC.

mpvl commented 3 years ago

@smasher164

I could imagine an API like

Something like that. Passing a function with a signature func(rune) bool instead of a range table makes more sense to me, though. It doesn't always make sense to represent rune properties as a range table (for instance for bidi and, I suspect, emoji).

gudvinr commented 3 years ago

It's a bug for core Unicode packages in x/text to not be updated to the right version.

I figured out what's wrong. It is not a bug in x/text and not a package issue per se. I suppose build environment for pkgsite uses some older Go release and takes older table which has // +build go1.14,!go1.16. There's no indication of that on pkgsite and it pulls latest stable release for Go itself.
And since browsing local package cache isn't very convenient, I never tried to look there. But after you mentioned build tag trick I dug up commit history and that became clear.

Theoretically, core could refer to x/text for all its tables

I personally do not like the idea of pulling v0 packages for use in somewhat stable releases of Go.
However, is it possible to use emojis and their properties as experimental playground first, and based on the results of this experiment make changes to rest of the tables later?

Whether you plan on using range tables or not for these kind of characters, I suppose it is now decided to put them in separate package within x/text repository. This is a good thing in a sense that it makes possible to also add other emoji-related properties and functionalities described in UTS#51 in the future.

gudvinr commented 3 years ago

Maybe the way forward here is to either define these properties in x/text, and file a proposal for regexp?

I think that makes sense, yes. API for pluggable character classes for regexp is fine for me and probably covers other use cases too. It will be wise to fill separate proposal and discuss details of the implementations there.

rsc commented 3 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc commented 3 years ago

Even if we added these to unicode.Properties, regexp only does Categories and Scripts.
And these emoji properties are properties, not categories or scripts.

Do you need emoji things in regexp, or was that just brought up for completeness?

gudvinr commented 3 years ago

Do you need emoji things in regexp

I think I do not need regexp support. At least for me regexp isn't a top priority.
If emoji support eventually land to either unicode or x/text and will be at least as convenient to use as properties for regexp, then I can live with it.

But it's not a simple question, to be honest. In a short time span I had to solve multiple unrelated problems with emojis. I feel that some of them can be solved easier using some sort of property handles in regexp.

regexp only does Categories and Scripts

Is there any reason for that? I found that when I looked through regexp sources but it's not clear why \p{Dash}, \p{Hyphen} and such are ignored. If I'm not wrong, ICU library doesn't have such limitations, for example.
I do not imply that "if some other %thing% does then Go should too", though.

rsc commented 3 years ago

I don't remember why I left Property out. Possibly it just seemed like too much for too little benefit. Category and Script are more clearly useful.

smasher164 commented 3 years ago

As an anecdote, the python regex package I used to test my identifier validating library does support properties. I suppose if the regexp package supports user-definable properties, it wouldn't have the burden of adding them all.

rsc commented 3 years ago

As long as regexp is not a requirement, then adding these to unicode.Properties probably makes sense. The thing I don't know is what else is missing from unicode.Properties. Can someone cross-check against the full Unicode property list and see what else is missing besides these emoji properties?

gudvinr commented 3 years ago

Can someone cross-check against the full Unicode property list and see what else is missing besides these emoji properties?

I took a look at UAX#44 and marked with + what's in Properties since it's not much:

``` General Name Name_Alias Block Age General_Category Script Script_Extensions +White_Space (binary) Alphabetic (binary) Hangul_Syllable_Type +Noncharacter_Code_Point (binary) Default_Ignorable_Code_Point (binary) +Deprecated (binary) +Logical_Order_Exception (binary) +Variation_Selector (binary) Case Uppercase (binary) Lowercase (binary) Lowercase_Mapping Titlecase_Mapping Uppercase_Mapping Case_Folding Simple_Lowercase_Mapping Simple_Titlecase_Mapping Simple_Uppercase_Mapping Simple_Case_Folding +Soft_Dotted (binary) Cased (binary) Case_Ignorable (binary) Changes_When_Lowercased (binary) Changes_When_Uppercased (binary) Changes_When_Titlecased (binary) Changes_When_Casefolded (binary) Changes_When_Casemapped (binary) Emoji (all binary) Emoji Emoji_Presentation Emoji_Modifier Emoji_Modifier_Base Emoji_Component Extended_Pictographic Numeric Numeric_Value Numeric_Type +Hex_Digit (binary) +ASCII_Hex_Digit (binary) Normalization Canonical_Combining_Class Decomposition_Mapping (not recommended) Composition_Exclusion (binary) (not recommended) Full_Composition_Exclusion (binary) (not recommended) Decomposition_Type FC_NFKC_Closure (deprecated) NFC_Quick_Check NFKC_Quick_Check NFD_Quick_Check NFKD_Quick_Check Expands_On_NFC (binary) (deprecated) Expands_On_NFD (binary) (deprecated) Expands_On_NFKC (binary) (deprecated) Expands_On_NFKD (binary) (deprecated) NFKC_Casefold Changes_When_NFKC_Casefolded (binary) Shaping and Rendering +Join_Control (binary) Joining_Group Joining_Type Vertical_Orientation East_Asian_Width +Prepended_Concatenation_Mark (binary) Bidirectional Bidi_Class +Bidi_Control (binary) Bidi_Mirrored (binary) Bidi_Mirroring_Glyph Bidi_Paired_Bracket Bidi_Paired_Bracket_Type Identifiers (all binary) ID_Continue ID_Start XID_Continue XID_Start +Pattern_Syntax +Pattern_White_Space Segmentation Line_Break Grapheme_Cluster_Break Sentence_Break Word_Break CJK +Ideographic (binary) +Unified_Ideograph (binary) +Radical (binary) +IDS_Binary_Operator (binary) +IDS_Trinary_Operator (binary) Unicode_Radical_Stroke Equivalent_Unified_Ideograph Miscellaneous Math (binary) +Quotation_Mark (binary) +Dash (binary) +Hyphen (binary) (deprecated, stabilized) +Sentence_Terminal (binary) +Terminal_Punctuation (binary) +Diacritic (binary) +Extender (binary) Grapheme_Base (binary) Grapheme_Extend (binary) Grapheme_Link (binary) (deprecated) Unicode_1_Name ISO_Comment (deprecated, stabilized) +Regional_Indicator (binary) Indic_Positional_Category Indic_Syllabic_Category Contributory Properties (not recommended) +Other_Alphabetic (binary) +Other_Default_Ignorable_Code_Point (binary) +Other_Grapheme_Extend (binary) +Other_ID_Start (binary) +Other_ID_Continue (binary) +Other_Lowercase (binary) +Other_Math (binary) +Other_Uppercase (binary) Jamo_Short_Name ```
ZekeLu commented 3 years ago

This is a copy of @gudvinr 's answer above, with missing properties highlighted.

```diff General Name Name_Alias Block Age General_Category Script Script_Extensions +White_Space Alphabetic Hangul_Syllable_Type +Noncharacter_Code_Point Default_Ignorable_Code_Point +Deprecated +Logical_Order_Exception +Variation_Selector Case Uppercase Lowercase Lowercase_Mapping Titlecase_Mapping Uppercase_Mapping Case_Folding Simple_Lowercase_Mapping Simple_Titlecase_Mapping Simple_Uppercase_Mapping Simple_Case_Folding +Soft_Dotted Cased Case_Ignorable Changes_When_Lowercased Changes_When_Uppercased Changes_When_Titlecased Changes_When_Casefolded Changes_When_Casemapped Emoji Emoji Emoji_Presentation Emoji_Modifier Emoji_Modifier_Base Emoji_Component Extended_Pictographic Numeric Numeric_Value Numeric_Type +Hex_Digit +ASCII_Hex_Digit Normalization Canonical_Combining_Class Decomposition_Mapping (not recommended) Composition_Exclusion (not recommended) Full_Composition_Exclusion (not recommended) Decomposition_Type FC_NFKC_Closure (deprecated) NFC_Quick_Check NFKC_Quick_Check NFD_Quick_Check NFKD_Quick_Check Expands_On_NFC (deprecated) Expands_On_NFD (deprecated) Expands_On_NFKC (deprecated) Expands_On_NFKD (deprecated) NFKC_Casefold Changes_When_NFKC_Casefolded Shaping and Rendering +Join_Control Joining_Group Joining_Type Vertical_Orientation East_Asian_Width +Prepended_Concatenation_Mark Bidirectional Bidi_Class +Bidi_Control Bidi_Mirrored Bidi_Mirroring_Glyph Bidi_Paired_Bracket Bidi_Paired_Bracket_Type Identifiers ID_Continue ID_Start XID_Continue XID_Start +Pattern_Syntax +Pattern_White_Space Segmentation Line_Break Grapheme_Cluster_Break Sentence_Break Word_Break CJK +Ideographic +Unified_Ideograph +Radical +IDS_Binary_Operator +IDS_Trinary_Operator Unicode_Radical_Stroke Equivalent_Unified_Ideograph Miscellaneous Math +Quotation_Mark +Dash +Hyphen (deprecated, stabilized) +Sentence_Terminal +Terminal_Punctuation +Diacritic +Extender Grapheme_Base Grapheme_Extend Grapheme_Link (deprecated) Unicode_1_Name ISO_Comment (deprecated, stabilized) +Regional_Indicator Indic_Positional_Category Indic_Syllabic_Category Contributory Properties (not recommended) +Other_Alphabetic +Other_Default_Ignorable_Code_Point +Other_Grapheme_Extend +Other_ID_Start +Other_ID_Continue +Other_Lowercase +Other_Math +Other_Uppercase Jamo_Short_Name ```
beoran commented 3 years ago

Just to chip in: the missing properties would be useful for Go GUI libraries, in particular for implementing bi-directional and complex script rendering. But x/text might be just as well a place to keep them as unicode/ for them.

mpvl commented 3 years ago

To add my view: especially if there is not going to be regexp support for properties, it doesn't make sense to add these properties to the set of properties for package unicode.

Many of the "unsupported" properties as already supported in x/text, just not as RangeTables. Some of these tables, such Normalization and Bidi related tables, are even included in core. Adding these to Properties would just bloat the unicode package.

The reason why x/text didn't use RangeTables for many of these properties is because such properties are often not useful in isolation. This holds true for Case-, Normalization-, Bidi-, Grapheme-, Identifier-, and I suspect also Emoji-related properties. Folding these properties in a single per-rune/per-topic trie data structure, has proven to give significant performance benefits. The packages cases, norm, bidi, precis, and idna, for instance, all follow this pattern.

I could imagine that a selection of these properties would be useful for regexp, though.

mpvl commented 3 years ago

Note, btw, that the list of unsupported properties includes non-boolean properties (such as EastAsianWidth, included inx/text/unicode/width). These are not conveniently represented as range tables.

gudvinr commented 3 years ago

Note, btw, that the list of unsupported properties includes non-boolean properties

Good point. Here's the list of only unsupported boolean properties:

``` Alphabetic Default_Ignorable_Code_Point Uppercase Lowercase Cased Case_Ignorable Changes_When_Lowercased Changes_When_Uppercased Changes_When_Titlecased Changes_When_Casefolded Changes_When_Casemapped Emoji Emoji_Presentation Emoji_Modifier Emoji_Modifier_Base Emoji_Component Extended_Pictographic Changes_When_NFKC_Casefolded Bidi_Mirrored ID_Continue ID_Start XID_Continue XID_Start Math Grapheme_Base Grapheme_Extend ```
rsc commented 3 years ago

Based on the discussion above, this proposal seems like a likely decline. — rsc for the proposal review group

gudvinr commented 3 years ago

So, properties can't be added to properties list.
You can't match properties using character classes in regexp and can't add custom character classes either.

What is the recommended way to go then?

beoran commented 3 years ago

Perhaps we could still include these properties in x/text? But I suppose that should be a new issue?

rsc commented 3 years ago

@mpvl has some ideas about how to provide some info in x/text, but that would be a separate package. It would probably still not hook up to regexp.

rsc commented 3 years ago

No change in consensus, so declined. — rsc for the proposal review group