janlelis / unicode-emoji

Up-to-date Emoji Regex in Ruby ๐Ÿ’ฅ
https://character.construction/emoji
MIT License
147 stars 14 forks source link

Incorrect matching of several family emojis #12

Closed matt17r closed 2 years ago

matt17r commented 2 years ago

Scanning family emojis (e.g.) using the "recommended" REGEX results in a smaller family and a separate kid rather than matching the whole emoji in 15 cases:

Unicode::Emoji.list("People & Body", "family").each do |emoji|
  if emoji.length > emoji.scan(Unicode::Emoji::REGEX)[0].length
    puts "\"#{emoji}\".scan(Unicode::Emoji::REGEX) = #{emoji.scan(Unicode::Emoji::REGEX)}"
  end
end

# "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]
# "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                          
# "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                                
# "๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]                                                                
# "๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                                
# "๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                                
# "๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]
# "๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ง"]

I think this is because of the order of the generated REGEX. I suspect it finds the smaller match first and stops looking... After modifying the generated REGEX in lib/unicode/emoji/generated/regex.rb to REGEX = /(?:(?:๐Ÿ‘จโ€โค๏ธโ€๐Ÿ‘จ|๐Ÿ‘จโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘จ|๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง|๐Ÿ‘จโ€๐Ÿ‘ฆ|๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ|... (moving the ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง earlier) that particular emoji started matching correctly.

I have no idea how to update the generation logic to move those emojis to the front of the queue though?

janlelis commented 2 years ago

Hey @matt17r, thank you for your comments and bug requests.

Changing that order seems the right thing to do. I'll put in on my TODO list for later this month.. (or maybe @radarek is faster!)

janlelis commented 2 years ago

Hi @matt17r, this should be fixed in v3.1.1. Thanks again for the bug report.