milesj / emojibase

🎮 A collection of lightweight, up-to-date, pre-generated, specification compliant, localized emoji JSON datasets, regex patterns, and more.
https://emojibase.dev
MIT License
464 stars 38 forks source link

None of the regexes match emoji, and only emoji #174

Open robintown opened 3 months ago

robintown commented 3 months ago

A regex that matches emoji would be a really useful thing to have in the JS ecosystem! Unfortunately, between Emojibase and emoji-regex, I still haven't seen a package that actually does this. In the case of Emojibase:

What's missing is a regex that matches exactly those character sequences that are presented to users as emoji. Some characters are defined in Unicode to default to emoji presentation (see the Emoji_Presentation section), while others require U+FE0F to change their presentation mode. A correct implementation would account for both of these facts, and use a negative lookahead to avoid matching characters with U+FE0E.

milesj commented 3 months ago

I'll be honest, it's been so long since I've worked on this emoji stuff that I've forgotten a lot of how they work. I always have to re-learn the codebase each time I update it. So I'm sure there's bugs everywhere.

With that said, I am tinkering with the regex's here: https://github.com/milesj/emojibase/pull/175

milesj commented 3 months ago

So after looking at this post and the code again, this assumption is correct in how it works. It's by design.

  • emojibase-regex matches some textual characters such as '↔'.
  • emojibase-regex/emoji doesn't match emoji without U+FE0F, such as '✨'.
  • emojibase-regex/emoji-loose matches some textual characters without U+FE0E, such as '↔'.
  • And the rest of the provided regexes are obviously not intended to be used for matching emoji.

I also use regexgen (https://github.com/devongovett/regexgen) to generate the regex pattern, and it does not support negative lookaheads. I'm not aware of another library to handle this and I'm definitely not going to write it from scratch.

There is a regex using unicode properties, but I haven't tested it in years: https://emojibase.dev/docs/regex#unicode-property-support

milesj commented 2 months ago

Been thinking about this more, and I think we could solve this by using functions, like isEmojiPresentation and isTextPresentation, instead of relying purely on RegExp instances. With functions we could run the necessary checks to ensure it's exactly what you want.

robintown commented 2 months ago

Re: the Unicode properties approach, I was happy to discover that the new RegExp v mode makes writing an emoji regex by hand pretty easy, and this is what I've ended up going for.

/\p{RGI_Emoji}(?!\uFE0E)(?:(?<!\uFE0F)\uFE0F)?/v

All major browsers support it, though only as of late 2023. You can get a version that kinda sorta works while only using u mode if you replace \p{RGI_Emoji} with this regex, but it's not going to do well with flags and ZWJ sequences unless you teach it exactly what the valid sequences are.

milesj commented 2 months ago

Nice, good to know! Been waiting years for all those to become available.