emoji-regex/text thinks "1" is a an emoji

astoilkov commented 6 years ago

As suggested here I started using emoji-regex/text to detect all emojis. However, when using emoji-regex/text the regular expressions starts failing by thinking numbers are emojis as well.

astoilkov commented 6 years ago

It also fails with special characters like #

astoilkov commented 6 years ago

@gilmoreorless Do you have any suggestions or ideas?

mathiasbynens commented 6 years ago

This is not a bug. # and 0-9 are Emoji characters with a text representation by default, per the Unicode Standard.

astoilkov commented 6 years ago

I first want to thank for the amazing library. We are using it for some time and it helps us a lot. Keep up the good work.

Without negative feelings I am asking this question - Don't you think the library should have support for characters that humans say are emojis not a specification? For example, a person would say this is an emoji 🗡 but not this # or numbers. In our case we should hardcode some additional rules in order to fix this behavior as we can't say to our users that # and 0-9 are emojis.

We would probably find a way to workaround such issues by making an extra layer of emoji-regex. However, I just wanted to tell what I am thinking in order to help the library become better and more famous as it deserves.

mathiasbynens commented 6 years ago

@astoilkov What people consider to be an emoji depends on their operating system and the fonts they have installed. It’s impossible to create a static regular expression that takes the user’s environment into consideration.

So this project does the next best thing: it uses the Unicode Standard as the single source of truth. Whenever implementations (e.g. emoji on macOS) deviate from the standard (e.g. #28), there will always be a mismatch between what is matched and what you’d expect based on the OS behavior. There is no way around this.

We could apply the hacky workaround from https://github.com/mathiasbynens/emoji-regex/issues/28#issuecomment-323044429 to emoji-regex, and it would make such mismatches less common at the cost of being less technically correct — but it would still not fully solve the problem.

astoilkov commented 6 years ago

@mathiasbynens Thanks for the lengthy explanation. I now understand the problem in more detail. I think we can close this issue.

If I was in your position I would probably create another file(like text.js) that captures such scenarios but doesn't follow the specification and then I would describe that in the readme. This way you could fine tune it little by little.

gilmoreorless commented 6 years ago

Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.

My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".

@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text regex which excludes any standalone characters with the property Emoji_Component=Yes. Specifically that would mean these characters (from the 11.0 emoji-data.txt):

0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (🇦..🇿)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (🏻..🏿)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (🦰..🦳)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (󠀠..󠁿)      tag space..cancel tag

Those characters would still be correctly matched in their respective sequences.

Additionally, a "loose" regex could define the flag sequences as just [\u{1F1E6}-\u{1F1FF}]{2} (or even \p{Regional_Indicator}{2}), which would cut down the regex size at the cost of potentially matching invalid sequences.

I haven't actually tested this idea, mainly just thinking out loud.

(Edit: After looking at the proposed changes for the 11.0 spec, it seems that the new Extended_Pictographic=Yes property covers my use case rather neatly. "The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.")

josephrocca commented 4 years ago

Perhaps the readme could be updated to warn people about the counter-intuitive parts of this module? Just something like "watch out for these weird things about the unicode spec: ..."

In any case, here are all the symbols that emoji-regex/text misses (whether on purpose or not), in case there are any here which it is supposed to match:

⚲⚨⚮⚭⚥⚬⚢⚤⚯⚘⚦⚚⚩⚣⚐⚍⚎⚊⚌⚏⚋⚑⚇⚄♶♽♸☖♼⚉⚃⚆⚂♷♳♺⚈⚁♴⚀♹☗⚅♲♵☙♱♰☟☬♖✐♩☜♆☱☞♘☴♬☾☤♃☇☏☥♪♇☛☌☧★♚♞☒♯♜☚☋♄☶♧❦☼♗☽☍♁☡☷☰♫☲☭♙♭♕♔☐☓♛☨☳☻♅♤☵☩☊♡☈☫❧♮✎❥☉♢♝ ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍󠀯󠁃󠁀󠀠󠁈󠁂󠁌󠁕󠀮󠁄󠁾󠀧󠀡󠀾󠁋󠁗󠁍󠁚󠁎️󠁉󠁖󠀥󠁽󠀿󠁓󠁁󠀻󠁊󠀭󠁏󠁠󠁟⃣󠁇󠀽󠀦󠁅󠀼‍󠁝󠀪󠀨󠁻󠁒󠁜󠁞󠁑󠀩󠁙󠁆󠀤󠁼󠀺󠁘󠀫󠀢󠀣󠁐󠁔󠀬󠁛‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍󠁧󠁢󠁥󠁮󠁧󠁿󠁧󠁢󠁳󠁣󠁴󠁿󠁧󠁢󠁷󠁬󠁳󠁿󠀷󠁳󠁢󠁵󠁷󠁬󠁥󠀲󠁹󠁤󠀶󠁴󠀵󠁨󠁲󠁮󠁰󠁱󠀰󠁡󠀳󠁩󠁯󠁭󠀹󠁸󠁧󠁣󠁺󠀴󠁶󠁫󠁿󠀸󠁪󠁦󠀱‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍⃣⃣️⃣‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍₿🕬🗔🗫🗮🗉🗠🖢🗀🗪🖈🕈🗬🖀🖗🛆🖟🗲🕫🖯🕇🕪🗅🖰🌢🗩🗴🔿🗰🕱🎘🗶🖽🗤🖿🖻🗕🕼🛨🎝🔾🖘🖠🖎🕩🖫🖬🗘🖸🛦🖡🖜🖷🛉🏲🛱🕨🗁🗈🗌🗢🖳🎕🕅🗗🗚🗱🗋🖞🕭🗙🗹🖵🗐🛪🖏🖙🗧🎔🌣🖉🖹🗦🖧🖛🖪🛧🖚🖮🖆🗸🖦🎜🗇🛈🗵🖃🖾🛇🖺🖓🛊🕻🏱🕄🕾🖄🖝🖒🕲🗆🏶🖅🗍🗟🗖🗛🖩🕽🖴🕿🖂🗥🖑📾🕮🖣🛲🖶🗎🖔🗊🕆🗷🗭🖭🗏🖁🕀🕂🕁🕃⛥⛢⛤⛦⛧⛻⛾⛚⛆⛙⛕⚿⛒⛉⛊⛫⛘⛛⛖⛮⛬⛨⚞⛿⛜⛗⛣⛋⛝⛟⛐⛯⛼⛌⛶⛍⛡⛠⛞⛇⛭⚟🀦🀜🀓⚴🀚⚶🀩🀝🀆🀐🀋🀨🀉🀀🀂🀖🀅🀗🀢⛂⛃🀊🀠🀤⚼🀛🀑🀈⚝🀔🀎⚻🀡⛁🀫⚹🀕🀘🀙⚸🀏🀣⛀⚷🀪🀍⚵🀒⚺🀞߷🀃⚳🀇🀥🀧🀌🀁🀟

I made a module that matches these and also incorporates @gilmoreorless's variation selector fix: https://github.com/josephrocca/emoji-and-symbol-regex

ChurchTao commented 4 years ago

Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.

My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".

@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text regex which excludes any standalone characters with the property Emoji_Component=Yes. Specifically that would mean these characters (from the 11.0 emoji-data.txt):
0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (🇦..🇿)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (🏻..🏿)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (🦰..🦳)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (󠀠..󠁿)      tag space..cancel tag

I have the same idea as you, so I made a module of non Regex based on https://www.unicode.org/Public/emoji/13.0/emoji-test.txt.

https://github.com/ChurchTao/emoji-js

dezren39 commented 3 years ago

import _emojiRegex from 'emoji-regex/es2015/text.js';
const emojiRegex = () => new RegExp('('+_emojiRegex().toString().replace(/#\\\*0-9/gu, '')+'|\uFE0F\u20E3|\uFE0F|\u20E3)', 'gu'),

I did this, it doesn't count 0-9, #, *, the part at the end nixes the enclosing boxes for actual number emoji, but keeps the numbers, which is what I wanted for my circumstance. Pretty sure '|\uFE0F\u20E3|\uFE0F may be unneeded and just |\u20E3 would be sufficient. There are better ways to solve and I thought of more complex ways, but this is one extra line without making a whole new package.

Open to suggestion for a better method to handle. :+1: I also added the 'non-emoji' symbols that are basically emoji, etc, in my case, but that is secondary to this number issue.

While researching, I also found: https://github.com/tonton-pixel/emoji-patterns This package has each category split into it's own pattern, providing 2 larger patterns which join the categories together. If one needed a more nuanced take, they could try something like this, which may be useful in some cases.

Though, I believe the real solution is for TC39 to accept something like this (currently at proposal): https://mths.be/emoji

mathiasbynens commented 3 years ago

Is there anything left to do to resolve this issue? I'm closing it for now. If anyone wants to suggest a README improvement that calls out some of the Unicode weirdness we've discussed, please send a PR!

say8425 commented 3 years ago

import * as emojiPatterns from 'emoji-patterns';

const emojiRegex = new RegExp (emojiPatterns['Emoji_All'].replace(/\\u0023\\u002A\\u0030-\\u0039|\\u{1F1E6}-\\u{1F1FF}/gi, ''), 'gu');
emojiRegex.test(value);

Finally, I use a emoji-patterns package.

mathiasbynens / emoji-regex

emoji-regex/text thinks "1" is a an emoji #33