Incorrect matching of several family emojis

matt17r commented 2 years ago

Scanning family emojis (e.g.) using the "recommended" REGEX results in a smaller family and a separate kid rather than matching the whole emoji in 15 cases:

Unicode::Emoji.list("People & Body", "family").each do |emoji|
  if emoji.length > emoji.scan(Unicode::Emoji::REGEX)[0].length
    puts "\"#{emoji}\".scan(Unicode::Emoji::REGEX) = #{emoji.scan(Unicode::Emoji::REGEX)}"
  end
end

# "👨‍👩‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👩‍👧", "👦"]
# "👨‍👩‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👩‍👦", "👦"]                                                          
# "👨‍👩‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👨‍👩‍👧", "👧"]                                                          
# "👨‍👨‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👨‍👧", "👦"]                                                          
# "👨‍👨‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👨‍👦", "👦"]                                                          
# "👨‍👨‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👨‍👨‍👧", "👧"]                                                          
# "👩‍👩‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👩‍👧", "👦"]                                                          
# "👩‍👩‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👩‍👦", "👦"]                                                          
# "👩‍👩‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👩‍👩‍👧", "👧"]                                                          
# "👨‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👦", "👦"]                                                                
# "👨‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👧", "👦"]                                                                
# "👨‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👨‍👧", "👧"]                                                                
# "👩‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👦", "👦"]                                                                
# "👩‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👧", "👦"]
# "👩‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👩‍👧", "👧"]

I think this is because of the order of the generated REGEX. I suspect it finds the smaller match first and stops looking... After modifying the generated REGEX in lib/unicode/emoji/generated/regex.rb to REGEX = /(?:(?:👨‍❤️‍👨|👨‍❤️‍💋‍👨|👨‍👩‍👧‍👧|👨‍👦|👨‍👦‍👦|... (moving the 👨‍👩‍👧‍👧 earlier) that particular emoji started matching correctly.

I have no idea how to update the generation logic to move those emojis to the front of the queue though?

janlelis commented 2 years ago

Hey @matt17r, thank you for your comments and bug requests.

Changing that order seems the right thing to do. I'll put in on my TODO list for later this month.. (or maybe @radarek is faster!)

janlelis commented 2 years ago

Hi @matt17r, this should be fixed in v3.1.1. Thanks again for the bug report.

janlelis / unicode-emoji

Incorrect matching of several family emojis #12