mathiasbynens / emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
https://mths.be/emoji-regex
MIT License
1.73k stars 174 forks source link

Some emojis ending with `\ufe0f`ย are not completely matched #28

Closed merih closed 5 years ago

merih commented 7 years ago

Male detective emoji, ๐Ÿ•ต๏ธ "\u{1f575}\ufe0f", when matched with emoji regex, not all of its codepoints are consumed, leaving \ufe0f behind. The emoji is typed with control+cmd+space shortcut of Mac.

"\u{1f575}\ufe0f".match(emojiRegex(), "").length
//> 1
mathiasbynens commented 7 years ago

Thatโ€™s not a standard emoji sequence AFAICT โ€” U+1F575 U+FE0F is not listed in emoji-zwj-sequences.txt. The U+FE0F is not necessary.

gilmoreorless commented 7 years ago

Apple appears to have a very loose idea of conformance to the standard set of codepoints. While working on other fixes for emoji-regex I created a list of all the emoji available on my Mac (macOS 10.12.6) via the emoji picker. (Handy hint: Don't do that if you value your time and patience.)

There were 49 Emoji_Presentation or Emoji_Modifier_Base characters that have U+FE0F appended to them by the macOS picker, with no real consistency about which ones do or don't get the variation selector added (e.g. ๐Ÿคž doesn't but โœŒ๏ธ does). Plus there were another 100 or so textual representation characters that are displayed by macOS in presentation mode without appending U+FE0F.

anoff commented 7 years ago

Any real downsides to adding this control character to the regex? Besides bloating the regex just to workaround a possible MacOS bug.

artyom commented 7 years ago

Excerpt from http://unicode.org/Public/emoji/5.0/emoji-test.txt:

1F575 FE0F                                 ; fully-qualified     # ๐Ÿ•ต๏ธ detective
1F575                                      ; non-fully-qualified # ๐Ÿ•ต detective

So the sequence in question is rather conformant.

mathiasbynens commented 7 years ago

Thanks for the pointer, @artyom!

Per http://unicode.org/reports/tr51/#Emoji_Implementation_Notes, emoji ZWJ sequences โ€œmay have an emoji presentation selectorโ€.

mathiasbynens commented 7 years ago

Hacky solution: just add \uFE0F? to the regex (or [\uFE0E\uFE0F]? for the text regex). However, some of the sequences already end with presentation or variation selectors and are therefore already qualified โ€” those shouldnโ€™t be matched along with the U+FE0F. A proper fix will take some more time.

gilmoreorless commented 7 years ago

For my own project's use I ended up going with that same hacky solution. I figured it wasn't right to submit a PR back to this project for it, so I just left it on a custom branch of my fork.

fredvollmer commented 6 years ago

Are there any plans to integrate this into the project? It seems that the consensus is that this is a legitimate use case...sorry if I'm off base here

mathiasbynens commented 6 years ago

@fredvollmer https://github.com/mathiasbynens/emoji-regex/issues/28#issuecomment-323044429 answers your question. Iโ€™d welcome a patch :)

gdutwyg commented 6 years ago

@mathiasbynens how to solve this quesiton? I met this question, too

jerry153fish commented 6 years ago

Hi @mathiasbynens, is it possible to add rules for those not fall on the sequence

egs :

๐Ÿฟ ๐Ÿ•Š ๐Ÿ‘ ๐Ÿ•ท ๐Ÿ•ธ ๐Ÿ‘“ โ›‘ ๐Ÿ—ฃ ๐Ÿ•ถ โœŒ๏ธ โ˜๏ธ โœ๏ธ โœŒ๐Ÿผ โšก๏ธ โญ๏ธ ๐ŸŒช ๐ŸŒค ๐ŸŒฅ ๐ŸŒฆ ๐ŸŒง โ›ˆ ๐ŸŒฉ ๐ŸŒจ ๐ŸŒฌ ๐Ÿ’จ ๐ŸŒถ ๐Ÿฝ โ›ธ โ›ท ๐ŸŽ– ๐Ÿต ๐ŸŽ— ๐ŸŽŸ ๐ŸŽ ๐Ÿ ๐Ÿ›ฉ ๐Ÿ›ฐ ๐Ÿ›ฅ ๐Ÿ›ณ ๐Ÿ—บ ๐ŸŸ โ›ฑ ๐Ÿ– ๐Ÿ ๐Ÿœ โ›ฐ ๐Ÿ” ๐Ÿ• ๐Ÿš ๐Ÿ˜ ๐Ÿ— ๐Ÿ› โ›ฉ ๐Ÿ›ค ๐Ÿ›ฃ ๐Ÿž ๐Ÿ™ ๐Ÿ–ฅ ๐Ÿ–จ ๐Ÿ–ฑ ๐Ÿ–ฒ ๐Ÿ•น ๐Ÿ—œ ๐Ÿ“ฝ ๐ŸŽž ๐ŸŽ™ ๐ŸŽš ๐ŸŽ› โฑ โฒ ๐Ÿ•ฏ ๐Ÿ—‘ ๐Ÿ›ข // โš’ ๐Ÿ›  โ› โ›“ ๐Ÿ—ก ๐Ÿ›ก ๐Ÿ•ณ ๐ŸŒก ๐Ÿ›Ž ๐Ÿ— ๐Ÿ›‹ ๐Ÿ› ๐Ÿ–ผ ๐Ÿ› ๐Ÿท ๐Ÿ—’ ๐Ÿ—“ ๐Ÿ—ƒ ๐Ÿ—ณ ๐Ÿ—„ ๐Ÿ—‚ ๐Ÿ—ž ๐Ÿ–‡ ๐Ÿ–Š ๐Ÿ–‹ ๐Ÿ–Œ ๐Ÿ– ๐Ÿ•‰ โธ โฏ โน โบ โญ โฎ ๐Ÿ‘โ€๐Ÿ—จ ๐Ÿ—ฏ ๐Ÿ•ฐ โ›ด ๐ŸŒซ ๐Ÿ€„ โ›„๏ธ โ›…๏ธ โ˜”๏ธ โ˜•๏ธ โšฝ๏ธ โšพ โ›ณ๏ธ โ›ต๏ธ โ›ฝ๏ธ โš“๏ธ โ›ฒ๏ธ โ›บ๏ธ โ›ช๏ธ โŒš๏ธ โŒ›๏ธ โ™ˆ๏ธ โ™‰๏ธ โ™Š๏ธ โ™‹๏ธ โ™Œ๏ธ โ™๏ธ โ™Ž๏ธ โ™๏ธ โ™๏ธ โ™‘๏ธ โ™’๏ธ โ™“๏ธ ๐Ÿˆš๏ธ โญ•๏ธ โ›”๏ธ โ—๏ธ ๐Ÿˆฏ๏ธ โ™ฟ๏ธ โšช๏ธ โšซ๏ธ โฌ›๏ธ โฌœ๏ธ โ—พ๏ธ โ—ฝ๏ธ

mathiasbynens commented 5 years ago

Try again using the latest release!

const emojiRegex = require('emoji-regex');

const string = '\u{1F575}\uFE0F'; // '๐Ÿ•ต๏ธ'
console.log(
    string.match(emojiRegex())
);
// โ†’ [ '๐Ÿ•ต๏ธ' ]

Closing as fixed. Feel free to reopen or file a new bug in case I missed anything.