Closed astoilkov closed 3 years ago
It also fails with special characters like #
@gilmoreorless Do you have any suggestions or ideas?
This is not a bug. #
and 0
-9
are Emoji
characters with a text representation by default, per the Unicode Standard.
I first want to thank for the amazing library. We are using it for some time and it helps us a lot. Keep up the good work.
Without negative feelings I am asking this question - Don't you think the library should have support for characters that humans say are emojis not a specification? For example, a person would say this is an emoji ๐ก
but not this #
or numbers. In our case we should hardcode some additional rules in order to fix this behavior as we can't say to our users that #
and 0-9
are emojis.
We would probably find a way to workaround such issues by making an extra layer of emoji-regex
. However, I just wanted to tell what I am thinking in order to help the library become better and more famous as it deserves.
@astoilkov What people consider to be an emoji depends on their operating system and the fonts they have installed. Itโs impossible to create a static regular expression that takes the userโs environment into consideration.
So this project does the next best thing: it uses the Unicode Standard as the single source of truth. Whenever implementations (e.g. emoji on macOS) deviate from the standard (e.g. #28), there will always be a mismatch between what is matched and what youโd expect based on the OS behavior. There is no way around this.
We could apply the hacky workaround from https://github.com/mathiasbynens/emoji-regex/issues/28#issuecomment-323044429 to emoji-regex, and it would make such mismatches less common at the cost of being less technically correct โ but it would still not fully solve the problem.
@mathiasbynens Thanks for the lengthy explanation. I now understand the problem in more detail. I think we can close this issue.
If I was in your position I would probably create another file(like text.js) that captures such scenarios but doesn't follow the specification and then I would describe that in the readme. This way you could fine tune it little by little.
Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.
My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes
property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F
presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".
@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text
regex which excludes any standalone characters with the property Emoji_Component=Yes
. Specifically that would mean these characters (from the 11.0 emoji-data.txt):
0023 ; Emoji_Component # 1.1 [1] (#๏ธ) number sign
002A ; Emoji_Component # 1.1 [1] (*๏ธ) asterisk
0030..0039 ; Emoji_Component # 1.1 [10] (0๏ธ..9๏ธ) digit zero..digit nine
200D ; Emoji_Component # 1.1 [1] (โ) zero width joiner
20E3 ; Emoji_Component # 3.0 [1] (โฃ) combining enclosing keycap
FE0F ; Emoji_Component # 3.2 [1] () VARIATION SELECTOR-16
1F1E6..1F1FF ; Emoji_Component # 6.0 [26] (๐ฆ..๐ฟ) regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF ; Emoji_Component # 8.0 [5] (๐ป..๐ฟ) light skin tone..dark skin tone
1F9B0..1F9B3 ; Emoji_Component # 11.0 [4] (๐ฆฐ..๐ฆณ) red-haired..white-haired
E0020..E007F ; Emoji_Component # 3.1 [96] (๓ ..๓ ฟ) tag space..cancel tag
Those characters would still be correctly matched in their respective sequences.
Additionally, a "loose" regex could define the flag sequences as just [\u{1F1E6}-\u{1F1FF}]{2}
(or even \p{Regional_Indicator}{2}
), which would cut down the regex size at the cost of potentially matching invalid sequences.
I haven't actually tested this idea, mainly just thinking out loud.
(Edit: After looking at the proposed changes for the 11.0 spec, it seems that the new Extended_Pictographic=Yes
property covers my use case rather neatly. "The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.")
Perhaps the readme could be updated to warn people about the counter-intuitive parts of this module? Just something like "watch out for these weird things about the unicode spec: ..."
In any case, here are all the symbols that emoji-regex/text
misses (whether on purpose or not), in case there are any here which it is supposed to match:
โฒโจโฎโญโฅโฌโขโคโฏโโฆโโฉโฃโโโโโโโโโโโถโฝโธโโผโโโโโทโณโบโโโดโโนโโ
โฒโตโโฑโฐโโฌโโโฉโโโฑโโโดโฌโพโคโโโโฅโชโโโโงโ
โโโโฏโโโโโถโงโฆโผโโฝโโโกโทโฐโซโฒโญโโญโโโโโโจโณโปโ
โคโตโฉโโกโโซโงโฎโโฅโโขโย โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ๓ ฏ๓ ๓ ๓ ๓ ๓ ๓ ๓ ๓ ฎ๓ ๓ พ๓ ง๓ ก๓ พ๓ ๓ ๓ ๓ ๓ ๏ธ๓ ๓ ๓ ฅ๓ ฝ๓ ฟ๓ ๓ ๓ ป๓ ๓ ญ๓ ๓ ๓ โฃ๓ ๓ ฝ๓ ฆ๓
๓ ผโ๓ ๓ ช๓ จ๓ ป๓ ๓ ๓ ๓ ๓ ฉ๓ ๓ ๓ ค๓ ผ๓ บ๓ ๓ ซ๓ ข๓ ฃ๓ ๓ ๓ ฌ๓ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ๓ ง๓ ข๓ ท๓ ฌ๓ ณ๓ ฟ๓ ท๓ ณ๓ ข๓ ต๓ ท๓ ฌ๓ ฅ๓ ฒ๓ น๓ ค๓ ถ๓ ด๓ ต๓ จ๓ ฒ๓ ฎ๓ ฐ๓ ฑ๓ ฐ๓ ก๓ ณ๓ ฉ๓ ฏ๓ ญ๓ น๓ ธ๓ ง๓ ฃ๓ บ๓ ด๓ ถ๓ ซ๓ ฟ๓ ธ๓ ช๓ ฆ๓ ฑโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃโฃ๏ธโฃโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฟ๐ฌ๐๐ซ๐ฎ๐๐ ๐ข๐๐ช๐๐๐ฌ๐๐๐๐๐ฒ๐ซ๐ฏ๐๐ช๐
๐ฐ๐ข๐ฉ๐ด๐ฟ๐ฐ๐ฑ๐๐ถ๐ฝ๐ค๐ฟ๐ป๐๐ผ๐จ๐๐พ๐๐ ๐๐ฉ๐ซ๐ฌ๐๐ธ๐ฆ๐ก๐๐ท๐๐ฒ๐ฑ๐จ๐๐๐๐ข๐ณ๐๐
๐๐๐ฑ๐๐๐ญ๐๐น๐ต๐๐ช๐๐๐ง๐๐ฃ๐๐น๐ฆ๐ง๐๐ช๐ง๐๐ฎ๐๐ธ๐ฆ๐๐๐๐ต๐๐พ๐๐บ๐๐๐ป๐ฑ๐๐พ๐๐๐๐ฒ๐๐ถ๐
๐๐๐๐๐ฉ๐ฝ๐ด๐ฟ๐๐ฅ๐๐พ๐ฎ๐ฃ๐ฒ๐ถ๐๐๐๐๐ท๐ญ๐ญ๐๐๐๐๐๐โฅโขโคโฆโงโปโพโโโโโฟโโโโซโโโโฎโฌโจโโฟโโโฃโโโโโฏโผโโถโโกโ โโโญโ๐ฆ๐๐โด๐โถ๐ฉ๐๐๐๐๐จ๐๐๐๐๐
๐๐ขโโ๐๐ ๐คโผ๐๐๐โ๐๐โป๐กโ๐ซโน๐๐๐โธ๐๐ฃโโท๐ช๐โต๐โบ๐฿ท๐โณ๐๐ฅ๐ง๐๐๐
I made a module that matches these and also incorporates @gilmoreorless's variation selector fix: https://github.com/josephrocca/emoji-and-symbol-regex
Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.
My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the
Emoji=Yes
property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with theU+FE0F
presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the
text
regex which excludes any standalone characters with the propertyEmoji_Component=Yes
. Specifically that would mean these characters (from the 11.0 emoji-data.txt):0023 ; Emoji_Component # 1.1 [1] (#๏ธ) number sign 002A ; Emoji_Component # 1.1 [1] (*๏ธ) asterisk 0030..0039 ; Emoji_Component # 1.1 [10] (0๏ธ..9๏ธ) digit zero..digit nine 200D ; Emoji_Component # 1.1 [1] (โ) zero width joiner 20E3 ; Emoji_Component # 3.0 [1] (โฃ) combining enclosing keycap FE0F ; Emoji_Component # 3.2 [1] () VARIATION SELECTOR-16 1F1E6..1F1FF ; Emoji_Component # 6.0 [26] (๐ฆ..๐ฟ) regional indicator symbol letter a..regional indicator symbol letter z 1F3FB..1F3FF ; Emoji_Component # 8.0 [5] (๐ป..๐ฟ) light skin tone..dark skin tone 1F9B0..1F9B3 ; Emoji_Component # 11.0 [4] (๐ฆฐ..๐ฆณ) red-haired..white-haired E0020..E007F ; Emoji_Component # 3.1 [96] (๓ ..๓ ฟ) tag space..cancel tag
I have the same idea as you, so I made a module of non Regex based on https://www.unicode.org/Public/emoji/13.0/emoji-test.txt.
import _emojiRegex from 'emoji-regex/es2015/text.js';
const emojiRegex = () => new RegExp('('+_emojiRegex().toString().replace(/#\\\*0-9/gu, '')+'|\uFE0F\u20E3|\uFE0F|\u20E3)', 'gu'),
I did this, it doesn't count 0-9, #, *, the part at the end nixes the enclosing boxes for actual number emoji, but keeps the numbers, which is what I wanted for my circumstance. Pretty sure '|\uFE0F\u20E3|\uFE0F
may be unneeded and just |\u20E3
would be sufficient. There are better ways to solve and I thought of more complex ways, but this is one extra line without making a whole new package.
Open to suggestion for a better method to handle. :+1: I also added the 'non-emoji' symbols that are basically emoji, etc, in my case, but that is secondary to this number issue.
While researching, I also found: https://github.com/tonton-pixel/emoji-patterns This package has each category split into it's own pattern, providing 2 larger patterns which join the categories together. If one needed a more nuanced take, they could try something like this, which may be useful in some cases.
Though, I believe the real solution is for TC39 to accept something like this (currently at proposal): https://mths.be/emoji
Is there anything left to do to resolve this issue? I'm closing it for now. If anyone wants to suggest a README improvement that calls out some of the Unicode weirdness we've discussed, please send a PR!
import * as emojiPatterns from 'emoji-patterns';
const emojiRegex = new RegExp (emojiPatterns['Emoji_All'].replace(/\\u0023\\u002A\\u0030-\\u0039|\\u{1F1E6}-\\u{1F1FF}/gi, ''), 'gu');
emojiRegex.test(value);
Finally, I use a emoji-patterns package.
As suggested here I started using
emoji-regex/text
to detect all emojis. However, when usingemoji-regex/text
the regular expressions starts failing by thinking numbers are emojis as well.