When removing emojis with .gsub, I'm getting error on compare with empty string.

andreleoni commented 5 years ago

Hello. I’m trying to use the gem to remove emojis from strings, but I’m getting an error when comparing the result with the expected string.

[33] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result = '🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Unicode::Emoji::REGEX_ANY, '')
=> "‍"
[34] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result == ''
=> false

[36] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump(regex_result)
=> "\x04\bI\"\b\xE2\x80\x8D\x06:\x06ET"
[37] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump('')
=> "\x04\bI\"\x00\x06:\x06ET"

What I’m doing wrong here? :sweat_smile:

janlelis commented 5 years ago

Hey Andre,

although REGEX_ANY does match a lot of emoji-related codepoints, it does not match some Unicode-codepoints that are used by emoji, but are also used outside of the emoji-world, like U+200D zero-width joiner. That's exactly what is happening here, there is still a ZJW in the data:

uniscribe '🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Unicode::Emoji::REGEX_ANY, '')

200D ├─ ]‍[     ├─ ZERO WIDTH JOINER

I've clarified this behavior in the README table.

What you want to do is to use REGEX which gives you better (and more robust) results. For example:

uniscribe '🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Unicode::Emoji::REGEX, '')

Unfortunately, this will let through textual emoji like

2195 ├─ ↕       ├─ UP DOWN ARROW`

To work around this issue, you can also remove emoji that respond to REGEX_TEXT, for example, like this:

'🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Regexp.union(Unicode::Emoji::REGEX, Unicode::Emoji::REGEX_TEXT), '') == "" # => true

Please leave some feedback, if this fixes your issue.

Actually, your feedback inspired me to have a REGEX_ALL regex in a future version of this gem, which will include textual emoji in its regex, see #5

janlelis commented 5 years ago

Closing, please re-open if problem persists

janlelis / unicode-emoji

When removing emojis with .gsub, I'm getting error on compare with empty string. #4