neosmart / unicode.net

A Unicode library for .NET, supporting UTF8, UTF16, and UTF32. With an extra helping of emoji for good measure 🔥🌶️😁
MIT License
87 stars 23 forks source link

Distinguish emoji: house and houses #17

Closed ya-erm closed 2 years ago

ya-erm commented 3 years ago

Those emoji should be different: 🏠 - House (https://emojipedia.org/house/) 🏘 - Houses / House Buildings (https://emojipedia.org/houses/)

But now Emoji.House returns Houses emoji https://github.com/neosmart/unicode.net/blob/982672213a4c291d33cf61afbcc25eb466e1dc51/unicode/Emoji-All.cs#L1756

mqudsi commented 3 years ago

Hello @ya-erm and thanks for the bug report!

My first thought was that the missing emoji was from a newer Unicode TR, and while indeed House is from Unicode 6.0 and Houses is from Unicode 7.0 (four years later), both those versions are supported.

We use an algorithm to heuristically clean up the names in the Unicode TR data to normalize the input, it's possible that it's clobbering the two values into one.

mqudsi commented 3 years ago

Actually, it's more complicated than that. The raw data that the current release was generated against is here, and you can see the definitions in the file:

Houses:

https://github.com/neosmart/unicode.net/blob/982672213a4c291d33cf61afbcc25eb466e1dc51/importers/emoji-test.txt#L2432

House:

https://github.com/neosmart/unicode.net/blob/982672213a4c291d33cf61afbcc25eb466e1dc51/importers/emoji-test.txt#L2438

The UTR lists both using the same name, and we're not manually distinguishing between them.

mqudsi commented 3 years ago

The file emoji-variation-sequence.txt contains data that can be used to distinguish between them further.

mqudsi commented 2 years ago

A new version of the package has been uploaded to nuget.org with the fix for this issue.