mathiasbynens / emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
https://mths.be/emoji-regex
MIT License
1.73k stars 174 forks source link

Regexp does not match all emoji but match digits #48

Closed cuper6 closed 3 years ago

cuper6 commented 5 years ago

Hello! I found two issues with v7.0.1 regular expression. 1) Regexp matches digits like 0,1...9 but not to matches some emoji codes like \u271d (Latin Cross Emoji). It seems the problem is here: (?:[#*0-9\xA9\xAE\u203C\ 2) Regexp does not matches some long emoji constructions like \uD83D\uDC73\u200D\u2640\uFE0F (👳‍♀️)

nolanlawson commented 5 years ago

For the second issue, it seems you can match that string using the text.js version of this library.

For the first issue, it seems like a true bug. Unless there is some technical reason why 0-9 should be included...

mathiasbynens commented 5 years ago

Re: matching digits 0-9: https://github.com/mathiasbynens/emoji-regex/issues/33#issuecomment-373674579

mathiasbynens commented 5 years ago

Re: the second issue you mention, can you share reproduction steps? This seems to work correctly:

const emoji = '\u{1F473}\u200D\u2640\uFE0F';
emoji.match(emojiRegex())[0] === emoji;
henrikra commented 5 years ago

@nolanlawson What is difference with text version? When either version should be used? 🤔 It is not documentated

mathiasbynens commented 5 years ago

@henrikra From the README:

To match emoji in their textual representation as well (i.e. emoji that are not Emoji_Presentation symbols and that aren’t forced to render as emoji by a variation selector), require the other regex:

const emojiRegex = require('emoji-regex/text.js');
henrikra commented 5 years ago

So can you give me example?

mathiasbynens commented 5 years ago

@henrikra Digits like 0-9 as @cuper6 mentioned

henrikra commented 5 years ago

Hmm I really dont get it. Can you give an actual example case A where you should use index version and case B where you should use text version

mathiasbynens commented 5 years ago

@henrikra The text flavor was added after people asked for it in https://github.com/mathiasbynens/emoji-regex/issues/13.

alies-dev commented 5 years ago

just tested, this lib does not match 🛠emoji

blixt commented 5 years ago

I'm also seeing issues with emoji like 🎛 and 🕸

mislav commented 4 years ago

To find characters that emoji-regex (from current master 6727974) doesn't match, I've downloaded http://unicode.org/Public/emoji/12.1/emoji-test.txt, then filtered it down to exclude "unqualified" and "component" entries:

grep -E '^[^#]' emoji-test.txt | grep -Ev '; (unqualified|component)'

I piped that into this script:

const emojiRegex = require('emoji-regex')()
let total = 0
let unmatched = 0

require('readline').createInterface({
    input: process.stdin
}).on('line', line => {
    const [_, description] = line.split(/#[^E]*/, 2)
    const [sequence] = line.split(/\s*;/, 2)
    const emoji = sequence.split(' ').map(c => String.fromCodePoint(parseInt(c, 16))).join('')
    total++
    if (!emojiRegex.exec(emoji)) {
        console.warn('unmatched: %s (%s)', description, sequence)
        unmatched++
    }
}).on('close', function() {
    console.log('%d/%d did not match', unmatched, total)
    if (unmatched > 0) process.exit(1)
})

The result is here. Its summary is:

1789/3767 did not match

Since the input were all fully qualified or partially qualified emoji, I had expected all of them to match. That 1789 failed to match is a bit worrisome, or an indicator that my assumptions are incorrect.

An example of a fully qualified emoji that didn't match: 🧐 “face with monocle” (1F9D0). Am I using emoji-regex wrong?

mislav commented 4 years ago

Sorry, disregard my above comment. I now see that I've been using exec() wrong; it should have been in a while loop like in the README:

-if (!emojiRegex.exec(emoji)) {
+let matched = false
+while (match = emojiRegex.exec(emoji)) matched = true
+if (!matched) {

They all match now! 🎉

mathiasbynens commented 3 years ago

Closing this issue since the /text.js question has been answered, and there's nothing actionable left. Feel free to re-open if I missed anything.