mathiasbynens / emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
https://mths.be/emoji-regex
MIT License
1.73k stars 174 forks source link

emoji-regex/text thinks "1" is a an emoji #33

Closed astoilkov closed 3 years ago

astoilkov commented 6 years ago

As suggested here I started using emoji-regex/text to detect all emojis. However, when using emoji-regex/text the regular expressions starts failing by thinking numbers are emojis as well.

astoilkov commented 6 years ago

It also fails with special characters like #

astoilkov commented 6 years ago

@gilmoreorless Do you have any suggestions or ideas?

mathiasbynens commented 6 years ago

This is not a bug. # and 0-9 are Emoji characters with a text representation by default, per the Unicode Standard.

astoilkov commented 6 years ago

I first want to thank for the amazing library. We are using it for some time and it helps us a lot. Keep up the good work.

Without negative feelings I am asking this question - Don't you think the library should have support for characters that humans say are emojis not a specification? For example, a person would say this is an emoji ๐Ÿ—ก but not this # or numbers. In our case we should hardcode some additional rules in order to fix this behavior as we can't say to our users that # and 0-9 are emojis.

We would probably find a way to workaround such issues by making an extra layer of emoji-regex. However, I just wanted to tell what I am thinking in order to help the library become better and more famous as it deserves.

mathiasbynens commented 6 years ago

@astoilkov What people consider to be an emoji depends on their operating system and the fonts they have installed. Itโ€™s impossible to create a static regular expression that takes the userโ€™s environment into consideration.

So this project does the next best thing: it uses the Unicode Standard as the single source of truth. Whenever implementations (e.g. emoji on macOS) deviate from the standard (e.g. #28), there will always be a mismatch between what is matched and what youโ€™d expect based on the OS behavior. There is no way around this.

We could apply the hacky workaround from https://github.com/mathiasbynens/emoji-regex/issues/28#issuecomment-323044429 to emoji-regex, and it would make such mismatches less common at the cost of being less technically correct โ€” but it would still not fully solve the problem.

astoilkov commented 6 years ago

@mathiasbynens Thanks for the lengthy explanation. I now understand the problem in more detail. I think we can close this issue.

If I was in your position I would probably create another file(like text.js) that captures such scenarios but doesn't follow the specification and then I would describe that in the readme. This way you could fine tune it little by little.

gilmoreorless commented 6 years ago

Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.

My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".

@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text regex which excludes any standalone characters with the property Emoji_Component=Yes. Specifically that would mean these characters (from the 11.0 emoji-data.txt):

0023          ; Emoji_Component      #  1.1  [1] (#๏ธ)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*๏ธ)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0๏ธ..9๏ธ)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (โ€)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (โƒฃ)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (๐Ÿ‡ฆ..๐Ÿ‡ฟ)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (๐Ÿป..๐Ÿฟ)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (๐Ÿฆฐ..๐Ÿฆณ)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (๓ € ..๓ ฟ)      tag space..cancel tag

Those characters would still be correctly matched in their respective sequences.

Additionally, a "loose" regex could define the flag sequences as just [\u{1F1E6}-\u{1F1FF}]{2} (or even \p{Regional_Indicator}{2}), which would cut down the regex size at the cost of potentially matching invalid sequences.

I haven't actually tested this idea, mainly just thinking out loud.

(Edit: After looking at the proposed changes for the 11.0 spec, it seems that the new Extended_Pictographic=Yes property covers my use case rather neatly. "The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.")

josephrocca commented 4 years ago

Perhaps the readme could be updated to warn people about the counter-intuitive parts of this module? Just something like "watch out for these weird things about the unicode spec: ..."

In any case, here are all the symbols that emoji-regex/text misses (whether on purpose or not), in case there are any here which it is supposed to match:

โšฒโšจโšฎโšญโšฅโšฌโšขโšคโšฏโš˜โšฆโššโšฉโšฃโšโšโšŽโšŠโšŒโšโš‹โš‘โš‡โš„โ™ถโ™ฝโ™ธโ˜–โ™ผโš‰โšƒโš†โš‚โ™ทโ™ณโ™บโšˆโšโ™ดโš€โ™นโ˜—โš…โ™ฒโ™ตโ˜™โ™ฑโ™ฐโ˜Ÿโ˜ฌโ™–โœโ™ฉโ˜œโ™†โ˜ฑโ˜žโ™˜โ˜ดโ™ฌโ˜พโ˜คโ™ƒโ˜‡โ˜โ˜ฅโ™ชโ™‡โ˜›โ˜Œโ˜งโ˜…โ™šโ™žโ˜’โ™ฏโ™œโ˜šโ˜‹โ™„โ˜ถโ™งโฆโ˜ผโ™—โ˜ฝโ˜โ™โ˜กโ˜ทโ˜ฐโ™ซโ˜ฒโ˜ญโ™™โ™ญโ™•โ™”โ˜โ˜“โ™›โ˜จโ˜ณโ˜ปโ™…โ™คโ˜ตโ˜ฉโ˜Šโ™กโ˜ˆโ˜ซโงโ™ฎโœŽโฅโ˜‰โ™ขโ™ย โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€๓ €ฏ๓ ƒ๓ €๓ € ๓ ˆ๓ ‚๓ Œ๓ •๓ €ฎ๓ „๓ พ๓ €ง๓ €ก๓ €พ๓ ‹๓ —๓ ๓ š๓ Ž๏ธ๓ ‰๓ –๓ €ฅ๓ ฝ๓ €ฟ๓ “๓ ๓ €ป๓ Š๓ €ญ๓ ๓  ๓ Ÿโƒฃ๓ ‡๓ €ฝ๓ €ฆ๓ …๓ €ผโ€๓ ๓ €ช๓ €จ๓ ป๓ ’๓ œ๓ ž๓ ‘๓ €ฉ๓ ™๓ †๓ €ค๓ ผ๓ €บ๓ ˜๓ €ซ๓ €ข๓ €ฃ๓ ๓ ”๓ €ฌ๓ ›โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ๓ ง๓ ข๓ ท๓ ฌ๓ ณ๓ ฟ๓ €ท๓ ณ๓ ข๓ ต๓ ท๓ ฌ๓ ฅ๓ €ฒ๓ น๓ ค๓ €ถ๓ ด๓ €ต๓ จ๓ ฒ๓ ฎ๓ ฐ๓ ฑ๓ €ฐ๓ ก๓ €ณ๓ ฉ๓ ฏ๓ ญ๓ €น๓ ธ๓ ง๓ ฃ๓ บ๓ €ด๓ ถ๓ ซ๓ ฟ๓ €ธ๓ ช๓ ฆ๓ €ฑโ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โƒฃโƒฃ๏ธโƒฃโ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ€โ‚ฟ๐Ÿ•ฌ๐Ÿ—”๐Ÿ—ซ๐Ÿ—ฎ๐Ÿ—‰๐Ÿ— ๐Ÿ–ข๐Ÿ—€๐Ÿ—ช๐Ÿ–ˆ๐Ÿ•ˆ๐Ÿ—ฌ๐Ÿ–€๐Ÿ–—๐Ÿ›†๐Ÿ–Ÿ๐Ÿ—ฒ๐Ÿ•ซ๐Ÿ–ฏ๐Ÿ•‡๐Ÿ•ช๐Ÿ—…๐Ÿ–ฐ๐ŸŒข๐Ÿ—ฉ๐Ÿ—ด๐Ÿ”ฟ๐Ÿ—ฐ๐Ÿ•ฑ๐ŸŽ˜๐Ÿ—ถ๐Ÿ–ฝ๐Ÿ—ค๐Ÿ–ฟ๐Ÿ–ป๐Ÿ—•๐Ÿ•ผ๐Ÿ›จ๐ŸŽ๐Ÿ”พ๐Ÿ–˜๐Ÿ– ๐Ÿ–Ž๐Ÿ•ฉ๐Ÿ–ซ๐Ÿ–ฌ๐Ÿ—˜๐Ÿ–ธ๐Ÿ›ฆ๐Ÿ–ก๐Ÿ–œ๐Ÿ–ท๐Ÿ›‰๐Ÿฒ๐Ÿ›ฑ๐Ÿ•จ๐Ÿ—๐Ÿ—ˆ๐Ÿ—Œ๐Ÿ—ข๐Ÿ–ณ๐ŸŽ•๐Ÿ•…๐Ÿ——๐Ÿ—š๐Ÿ—ฑ๐Ÿ—‹๐Ÿ–ž๐Ÿ•ญ๐Ÿ—™๐Ÿ—น๐Ÿ–ต๐Ÿ—๐Ÿ›ช๐Ÿ–๐Ÿ–™๐Ÿ—ง๐ŸŽ”๐ŸŒฃ๐Ÿ–‰๐Ÿ–น๐Ÿ—ฆ๐Ÿ–ง๐Ÿ–›๐Ÿ–ช๐Ÿ›ง๐Ÿ–š๐Ÿ–ฎ๐Ÿ–†๐Ÿ—ธ๐Ÿ–ฆ๐ŸŽœ๐Ÿ—‡๐Ÿ›ˆ๐Ÿ—ต๐Ÿ–ƒ๐Ÿ–พ๐Ÿ›‡๐Ÿ–บ๐Ÿ–“๐Ÿ›Š๐Ÿ•ป๐Ÿฑ๐Ÿ•„๐Ÿ•พ๐Ÿ–„๐Ÿ–๐Ÿ–’๐Ÿ•ฒ๐Ÿ—†๐Ÿถ๐Ÿ–…๐Ÿ—๐Ÿ—Ÿ๐Ÿ—–๐Ÿ—›๐Ÿ–ฉ๐Ÿ•ฝ๐Ÿ–ด๐Ÿ•ฟ๐Ÿ–‚๐Ÿ—ฅ๐Ÿ–‘๐Ÿ“พ๐Ÿ•ฎ๐Ÿ–ฃ๐Ÿ›ฒ๐Ÿ–ถ๐Ÿ—Ž๐Ÿ–”๐Ÿ—Š๐Ÿ•†๐Ÿ—ท๐Ÿ—ญ๐Ÿ–ญ๐Ÿ—๐Ÿ–๐Ÿ•€๐Ÿ•‚๐Ÿ•๐Ÿ•ƒโ›ฅโ›ขโ›คโ›ฆโ›งโ›ปโ›พโ›šโ›†โ›™โ›•โšฟโ›’โ›‰โ›Šโ›ซโ›˜โ››โ›–โ›ฎโ›ฌโ›จโšžโ›ฟโ›œโ›—โ›ฃโ›‹โ›โ›Ÿโ›โ›ฏโ›ผโ›Œโ›ถโ›โ›กโ› โ›žโ›‡โ›ญโšŸ๐Ÿ€ฆ๐Ÿ€œ๐Ÿ€“โšด๐Ÿ€šโšถ๐Ÿ€ฉ๐Ÿ€๐Ÿ€†๐Ÿ€๐Ÿ€‹๐Ÿ€จ๐Ÿ€‰๐Ÿ€€๐Ÿ€‚๐Ÿ€–๐Ÿ€…๐Ÿ€—๐Ÿ€ขโ›‚โ›ƒ๐Ÿ€Š๐Ÿ€ ๐Ÿ€คโšผ๐Ÿ€›๐Ÿ€‘๐Ÿ€ˆโš๐Ÿ€”๐Ÿ€Žโšป๐Ÿ€กโ›๐Ÿ€ซโšน๐Ÿ€•๐Ÿ€˜๐Ÿ€™โšธ๐Ÿ€๐Ÿ€ฃโ›€โšท๐Ÿ€ช๐Ÿ€โšต๐Ÿ€’โšบ๐Ÿ€ž฿ท๐Ÿ€ƒโšณ๐Ÿ€‡๐Ÿ€ฅ๐Ÿ€ง๐Ÿ€Œ๐Ÿ€๐Ÿ€Ÿ

I made a module that matches these and also incorporates @gilmoreorless's variation selector fix: https://github.com/josephrocca/emoji-and-symbol-regex

ChurchTao commented 4 years ago

Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.

My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".

@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text regex which excludes any standalone characters with the property Emoji_Component=Yes. Specifically that would mean these characters (from the 11.0 emoji-data.txt):

0023          ; Emoji_Component      #  1.1  [1] (#๏ธ)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*๏ธ)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0๏ธ..9๏ธ)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (โ€)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (โƒฃ)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (๐Ÿ‡ฆ..๐Ÿ‡ฟ)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (๐Ÿป..๐Ÿฟ)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (๐Ÿฆฐ..๐Ÿฆณ)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (๓ € ..๓ ฟ)      tag space..cancel tag

I have the same idea as you, so I made a module of non Regex based on https://www.unicode.org/Public/emoji/13.0/emoji-test.txt.

https://github.com/ChurchTao/emoji-js

dezren39 commented 3 years ago
import _emojiRegex from 'emoji-regex/es2015/text.js';
const emojiRegex = () => new RegExp('('+_emojiRegex().toString().replace(/#\\\*0-9/gu, '')+'|\uFE0F\u20E3|\uFE0F|\u20E3)', 'gu'),

I did this, it doesn't count 0-9, #, *, the part at the end nixes the enclosing boxes for actual number emoji, but keeps the numbers, which is what I wanted for my circumstance. Pretty sure '|\uFE0F\u20E3|\uFE0F may be unneeded and just |\u20E3 would be sufficient. There are better ways to solve and I thought of more complex ways, but this is one extra line without making a whole new package.

Open to suggestion for a better method to handle. :+1: I also added the 'non-emoji' symbols that are basically emoji, etc, in my case, but that is secondary to this number issue.


While researching, I also found: https://github.com/tonton-pixel/emoji-patterns This package has each category split into it's own pattern, providing 2 larger patterns which join the categories together. If one needed a more nuanced take, they could try something like this, which may be useful in some cases.

Though, I believe the real solution is for TC39 to accept something like this (currently at proposal): https://mths.be/emoji

mathiasbynens commented 3 years ago

Is there anything left to do to resolve this issue? I'm closing it for now. If anyone wants to suggest a README improvement that calls out some of the Unicode weirdness we've discussed, please send a PR!

say8425 commented 3 years ago
import * as emojiPatterns from 'emoji-patterns';

const emojiRegex = new RegExp (emojiPatterns['Emoji_All'].replace(/\\u0023\\u002A\\u0030-\\u0039|\\u{1F1E6}-\\u{1F1FF}/gi, ''), 'gu');
emojiRegex.test(value);

Finally, I use a emoji-patterns package.