janlelis / unicode-emoji

Up-to-date Emoji Regex in Ruby šŸ’„
https://character.construction/emoji
MIT License
147 stars 14 forks source link

When removing emojis with .gsub, I'm getting error on compare with empty string. #4

Closed andreleoni closed 5 years ago

andreleoni commented 5 years ago

Hello. Iā€™m trying to use the gem to remove emojis from strings, but Iā€™m getting an error when comparing the result with the expected string.

[33] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result = 'šŸ›¤šŸŽÆšŸ“®šŸ“˜ā†•ā­•šŸ‡¬šŸ‡¶šŸ‡¼šŸ‡øšŸ“ŖšŸ›ŽšŸ‘Øā€šŸŒ¾šŸŗšŸššŸ¤Æ'.gsub(Unicode::Emoji::REGEX_ANY, '')
=> "ā€"
[34] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result == ''
=> false

[36] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump(regex_result)
=> "\x04\bI\"\b\xE2\x80\x8D\x06:\x06ET"
[37] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump('')
=> "\x04\bI\"\x00\x06:\x06ET"

What Iā€™m doing wrong here? :sweat_smile:

janlelis commented 5 years ago

Hey Andre,

although REGEX_ANY does match a lot of emoji-related codepoints, it does not match some Unicode-codepoints that are used by emoji, but are also used outside of the emoji-world, like U+200D zero-width joiner. That's exactly what is happening here, there is still a ZJW in the data:

uniscribe 'šŸ›¤šŸŽÆšŸ“®šŸ“˜ā†•ā­•šŸ‡¬šŸ‡¶šŸ‡¼šŸ‡øšŸ“ŖšŸ›ŽšŸ‘Øā€šŸŒ¾šŸŗšŸššŸ¤Æ'.gsub(Unicode::Emoji::REGEX_ANY, '')

200D ā”œā”€ ]ā€[     ā”œā”€ ZERO WIDTH JOINER

I've clarified this behavior in the README table.

What you want to do is to use REGEX which gives you better (and more robust) results. For example:

uniscribe 'šŸ›¤šŸŽÆšŸ“®šŸ“˜ā†•ā­•šŸ‡¬šŸ‡¶šŸ‡¼šŸ‡øšŸ“ŖšŸ›ŽšŸ‘Øā€šŸŒ¾šŸŗšŸššŸ¤Æ'.gsub(Unicode::Emoji::REGEX, '')

Unfortunately, this will let through textual emoji like

2195 ā”œā”€ ā†•       ā”œā”€ UP DOWN ARROW`

To work around this issue, you can also remove emoji that respond to REGEX_TEXT, for example, like this:

'šŸ›¤šŸŽÆšŸ“®šŸ“˜ā†•ā­•šŸ‡¬šŸ‡¶šŸ‡¼šŸ‡øšŸ“ŖšŸ›ŽšŸ‘Øā€šŸŒ¾šŸŗšŸššŸ¤Æ'.gsub(Regexp.union(Unicode::Emoji::REGEX, Unicode::Emoji::REGEX_TEXT), '') == "" # => true

Please leave some feedback, if this fixes your issue.

Actually, your feedback inspired me to have a REGEX_ALL regex in a future version of this gem, which will include textual emoji in its regex, see #5

janlelis commented 5 years ago

Closing, please re-open if problem persists