Closed gudvinr closed 3 years ago
cc @robpike
CC @mpvl
I suspect this would be better done in a separate package, probably not in the standard library, as I believe that the data set will grow substantial over time and most programs won't need it.
Duplicate of @ @
That's not the right issue number for the duplicate, or else not the right issue for that duplicate number.
I agree that this should be done outside the stdlib. The Unicode technical reports (UAX, UAS, UTR) make it clear that they are independent specifications, and that conformance to the Unicode standard does not imply conformance to the technical report. It's also unclear if emoji is all we add, and not one of the other reports, like identifiers, script properties, etc.
I will say however, that the current API for constructing range tables is not very ergonomic. I ended up forgoing efficiency and used functions just so I had access to set union
and set difference
when combining range tables: https://github.com/smasher164/xid/blob/560c18f776900eb8c8b061d155309097f2f68545/xid.go.
smasher164: The Unicode technical reports (UAX, UAS, UTR) make it clear that they are independent specifications, and that conformance to the Unicode standard does not imply conformance to the technical report
This isn't exactly true. While UTS and UTR indeed aren't required to implement, Unicode Standard Annex, however, might be required by Standard. There's a list of such in Chapter 3 of Standard.
robpike: I suspect this would be better done in a separate package, probably not in the standard library smasher164: I agree that this should be done outside the stdlib
As pointed out, UTS isn't part of unicode standard, so I think this is quite reasonable to not put emoji stuff in unicode
.
While natural languges do not usually grow over time, emoji data will grow and forcing people to import data they don't want doesn't look good.
Although I feel like at least range tables (without emoji sequences and other trickery) should be somewhere close to unicode
to keep those tables in sync with rest of unicode tables. There's couple of reasons to do so:
x/text/unicode
uses different version scheme and has a separate release schedule. Not to mention it isn't v1 yet.UnicodeVersion = "12.0.0"
while unicode
has Version = "13.0.0"
. I think at some point external package might just have a different unicode version and this is kind of confusing.golang.org/x/text/unicode
mostly contains tools to work with unicode (except maybe /runenames
which contains, well, rune names) and I only mentioned data for range tables.I agree, I've often wanted unicode
to be versioned separately from the standard library, especially when it came to keeping these properties in sync with newer versions of unicode. I ended up using build tags to hack around that:
unicodeTestVersion_114.go
:
// +build go1.14,!go1.16
package xid
const unicodeTestVersion = "12.1.0"
unicodeTestVersion_116.go
:
// +build go1.16
package xid
const unicodeTestVersion = "13.0.0"
A natural place to put these properties would be x/text/unicode/emoji
.
This repo already has a runenames package, which could naturally hold emoji names as well (although I'm unsure how this fits with emoji sequences).
The argument for putting it in core so it can be used in regexp is valid. However, this would mean including more tables in core that are currently missing. A more reasonable solution to support emoji in regexp would be to allow user-defined character classes, allowing users to add classes from x/text, for instance.
It should also be mentioned what the goal is of these tables. Depending on the application, rangetables may not be the best representation. Judging from UTS #51, for instance, a UTF-8 trie, which allows associating a set of related properties with a single rune, seems more appropriate. The x/text repo has all the infrastructure in place to generate such tries conveniently.
@smasher164: the x/text repo uses a similar trick. The generators are multi-version aware and will automatically add to/modify build tags for generated tables.
It uses it, however, to ensure that the versions align with the latest Go. Your comment, however, suggest that you would want the other way around: have Go adopt a later version. This gives rise to the idea that core could use tables from x/text
directly. Core already uses x/text
for various packages and x/text
already generates the tables for core. So if instead core would use the tables from x/text
, x/text
could advance the unicode version ahead of core, while ensuring consistency between packages.
This obviously would require a separate proposal. There are some serious implications for this. Also, there are packages with hardcoded range tables. But all this could be worked around.
I don't want to be "that guy" but which policies are applied for decisions to where include what? Earlier it was said that "conformance to the Unicode standard does not imply conformance to the technical report".
Does it mean that something that conforms to the Unicode Standard should be (or at least considered worthy) included in stdlib?
@mpvl I see that x/text/runenames
already contains names for everything from UCD. That includes emojis. UCD essentially is UAX#41. From the document I mentioned earlier it is clear that UAX#41 "is considered part of Version 13.0 of the Unicode Standard" yet it not included in stdlib. Same for UAX#15 (normalization) and UAX#9 (bidirectional). But UAX#38 (unihan) is a part of stdlib unicode
package.
While x/text/unicode
contains number of annexes included in Standard, it uses different unicode version which makes it somewhat incompatible with stdlib.
Despite of UTS being "independent specification" by definition, Standard itself clearly mentions that only a few UTS synchronized with its version: UTS#10 (collation), UTS#39 (security), UTS#46 (idna) and UTS#51 (emoji).
It may be fine to be compatible for earlier version for tool packages but very frustrating for packages that contain databases. That includes emojis and runenames
too.
For example, conformance to Unicode 13 means that it should contain "Khitan Small Script". And unicode
indeed does. But not runenames
. Also in documentation of runenames
link to UCD points to "latest" version but tables are from 12.0.0.
I can't say that Khitan is popular language but that made me thinking that unicode-related packages now in a little bit messy state.
However, this would mean including more tables in core that are currently missing
Since other tables probably represented by other Technical Reports, addition of single one doesn't imply that other ones should be included too. These are still independent specs.
A more reasonable solution to support emoji in regexp would be to allow user-defined character classes, allowing users to add classes from x/text, for instance.
I agree that it should be more flexible way to do so. Although I don't think that I would be able to write a proper proposal for that.
It should also be mentioned what the goal is of these tables.
I can't say for others but I ended up in a situation where I need to be aware about emojis in text. First, to correctly remove such characters from text or replace them with non-graphical representations (e.g. using names from UCD). Second, to count number of characters when single emoji or emoji sequence represents single "character".
It seemed that range table and regexp support should be sufficient enough and already used by go library to represent language scripts/
While
x/text/unicode
contains number of annexes included in Standard, it uses different unicode version which makes it somewhat incompatible with stdlib.
The stdlib tables are generated from x/text. Core even depends on x/text and build tags in x/text ensure that the Unicode version of x/text is matched to that of core. So tip of x/text is ahead in Unicode version compared to core.
For example, conformance to Unicode 13 means that it should contain "Khitan Small Script". And
unicode
indeed does. But notrunenames
. Also in documentation ofrunenames
link to UCD points to "latest" version but tables are from 12.0.0.
That seems like a bug in runenames' generate script if true. It should update automatically with a Unicode upgrade. @nigeltao.
@mpvl
A more reasonable solution to support emoji in regexp would be to allow user-defined character classes, allowing users to add classes from x/text, for instance.
I could imagine an API like
func RegisterClass(name string, table *unicode.RangeTable)
in either regexp
or regexp/syntax
. Or if it needed to be scoped per *regexp.Regexp
, an alternative constructor like
func WithClass(name string, table *unicode.RangeTable, expr string) (*Regexp, error)
Either way, this would be a separate proposal.
Your comment, however, suggest that you would want the other way around: have Go adopt a later version.
I could imagine the stdlib being behind the supported version in x/text. That way, for example, someone who wanted to use unicode 13 functionality on Go 1.15 could simply import x/text/unicode
.
@gudvinr
Maybe the way forward here is to either define these properties in x/text
, and file a proposal for regexp
?
@gudvinr
At the same time, x/text/unicode uses different version scheme and has a separate release schedule.
Core Unicode tables are generated from x/text and core even imports x/text for various use cases, like normalization. Also, the x/text tables use build tags to keep these tables in sync. It's a bug for core Unicode packages in x/text to not be updated to the right version.
Theoretically, core could refer to x/text for all its tables, which would allow getting rid of the build tag trick and would allow using newer Unicode versions independently from the Go version. That needs some serious thought and some adjustment to existing packages like strconv IIRC.
@smasher164
I could imagine an API like
Something like that. Passing a function with a signature func(rune) bool
instead of a range table makes more sense to me, though. It doesn't always make sense to represent rune properties as a range table (for instance for bidi and, I suspect, emoji).
It's a bug for core Unicode packages in x/text to not be updated to the right version.
I figured out what's wrong. It is not a bug in x/text and not a package issue per se. I suppose build environment for pkgsite uses some older Go release and takes older table which has // +build go1.14,!go1.16
. There's no indication of that on pkgsite and it pulls latest stable release for Go itself.
And since browsing local package cache isn't very convenient, I never tried to look there. But after you mentioned build tag trick I dug up commit history and that became clear.
Theoretically, core could refer to x/text for all its tables
I personally do not like the idea of pulling v0 packages for use in somewhat stable releases of Go.
However, is it possible to use emojis and their properties as experimental playground first, and based on the results of this experiment make changes to rest of the tables later?
Whether you plan on using range tables or not for these kind of characters, I suppose it is now decided to put them in separate package within x/text
repository. This is a good thing in a sense that it makes possible to also add other emoji-related properties and functionalities described in UTS#51 in the future.
Maybe the way forward here is to either define these properties in x/text, and file a proposal for regexp?
I think that makes sense, yes. API for pluggable character classes for regexp is fine for me and probably covers other use cases too. It will be wise to fill separate proposal and discuss details of the implementations there.
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
Even if we added these to unicode.Properties, regexp only does Categories and Scripts.
And these emoji properties are properties, not categories or scripts.
Do you need emoji things in regexp, or was that just brought up for completeness?
Do you need emoji things in regexp
I think I do not need regexp support. At least for me regexp isn't a top priority.
If emoji support eventually land to either unicode or x/text and will be at least as convenient to use as properties for regexp, then I can live with it.
But it's not a simple question, to be honest. In a short time span I had to solve multiple unrelated problems with emojis. I feel that some of them can be solved easier using some sort of property handles in regexp.
regexp only does Categories and Scripts
Is there any reason for that? I found that when I looked through regexp
sources but it's not clear why \p{Dash}
, \p{Hyphen}
and such are ignored. If I'm not wrong, ICU library doesn't have such limitations, for example.
I do not imply that "if some other %thing% does then Go should too", though.
I don't remember why I left Property out. Possibly it just seemed like too much for too little benefit. Category and Script are more clearly useful.
As an anecdote, the python regex
package I used to test my identifier validating library does support properties. I suppose if the regexp package supports user-definable properties, it wouldn't have the burden of adding them all.
As long as regexp is not a requirement, then adding these to unicode.Properties probably makes sense. The thing I don't know is what else is missing from unicode.Properties. Can someone cross-check against the full Unicode property list and see what else is missing besides these emoji properties?
Can someone cross-check against the full Unicode property list and see what else is missing besides these emoji properties?
I took a look at UAX#44 and marked with +
what's in Properties
since it's not much:
This is a copy of @gudvinr 's answer above, with missing properties highlighted.
Just to chip in: the missing properties would be useful for Go GUI libraries, in particular for implementing bi-directional and complex script rendering. But x/text might be just as well a place to keep them as unicode/ for them.
To add my view: especially if there is not going to be regexp support for properties, it doesn't make sense to add these properties to the set of properties for package unicode
.
Many of the "unsupported" properties as already supported in x/text
, just not as RangeTables. Some of these tables, such Normalization and Bidi related tables, are even included in core. Adding these to Properties would just bloat the unicode package.
The reason why x/text didn't use RangeTables for many of these properties is because such properties are often not useful in isolation. This holds true for Case-, Normalization-, Bidi-, Grapheme-, Identifier-, and I suspect also Emoji-related properties. Folding these properties in a single per-rune/per-topic trie data structure, has proven to give significant performance benefits. The packages cases
, norm
, bidi
, precis
, and idna
, for instance, all follow this pattern.
I could imagine that a selection of these properties would be useful for regexp, though.
Note, btw, that the list of unsupported properties includes non-boolean properties (such as EastAsianWidth, included inx/text/unicode/width
). These are not conveniently represented as range tables.
Note, btw, that the list of unsupported properties includes non-boolean properties
Good point. Here's the list of only unsupported boolean properties:
Based on the discussion above, this proposal seems like a likely decline. — rsc for the proposal review group
So, properties can't be added to properties list.
You can't match properties using character classes in regexp
and can't add custom character classes either.
What is the recommended way to go then?
Perhaps we could still include these properties in x/text? But I suppose that should be a new issue?
@mpvl has some ideas about how to provide some info in x/text, but that would be a separate package. It would probably still not hook up to regexp.
What version of Go are you using (
go version
)?What do you propose?
There are number of range tables in
unicode
package of stdlib which define some of character properties from Unicode Character Database.Unicode also has additional sets of properties besides ones defined in core standard. These properties described in technical reports.
Notably, UTS#51 defines sets of properties to determine which unicode characters are emojis:
Emoji
property - These characters are recommended for use as emojiExtended_Pictographic
property - These characters are pictographicEmoji_Component
property - These characters are used in emoji sequencesEmoji_Presentation
property - A character that, by default, should appear with an emoji presentationEmoji_Modifier
- A character that can be used to modify the appearance of a preceding emojiEmoji_Modifier_Base
- A character whose appearance can be modified by a subsequent emoji modifierCharacter property
Regional_Indicator
already present inunicode
package.Data source
At the time of writing, go1.16 contains range tables from Unicode 13.0.0.
Thus, properties for emoji data also should be taken from UCD emoji data 13.0.0.
Package changes
New
RangeTable
variables (order follows emoji-data.txt):Emoji = _Emoji
Emoji_Component = _Emoji_Component
Emoji_Presentation = _Emoji_Presentation
Emoji_Modifier = _Emoji_Modifier
Emoji_Modifier_Base = _Emoji_Modifier_Base
Emoji_Component = _Emoji_Component
Extended_Pictographic = _Extended_Pictographic
Inclusion of functions for checking character properties like
IsEmoji
,IsEmojiModifier
,IsEmojiModifierBase
, etc doesn't make a lot of sense since there's alreadyunicode.In
function.However, some kind of function in form of
IsEmojiData
that checks range tables for all emoji-related properties might be useful to e.g. filter out all emoji components from text.To make these properties usable in
regexp
package, their names (or corresponding abbreviations) should be included intoCategories
orScripts
.Additional notes
Although UTS#51 defines emoji sequences, this issue does not cover this topic since emoji sequence consists of multiple characters and
unicode
package doesn't have a concept of "character sequence".Examples in other languages