Variant selector inconsistencies

milesj / emojibase

🎮 A collection of lightweight, up-to-date, pre-generated, specification compliant, localized emoji JSON datasets, regex patterns, and more.

https://emojibase.dev

MIT License

467 stars 38 forks source link

Variant selector inconsistencies #177

Open chop-suey opened 1 week ago

chop-suey commented 1 week ago

There seem to be some inconsistencies in the generated metadat (e.g. packages/data/en/data.raw.json).

In some cases the hexcode is missing the variant selector 16 fe0f according to the unicode data .

Examples:

Entry for "person in suit levitating"
- Should be 1F574-FE0F according to unicode
- hexcode is 1F574
- emoji contains the sequence 1F574-FE0F
Entry for "umbrella with rain drops"
- Should be 2614 according to unicode
- hexcode is 2614
- emoji contains the sequence 2614-FE0F
There are many more examples:

Like this i never now which property could be the source of truth. Am i missing something or is this an error in the data?

milesj commented 1 week ago

It's been a while since I've worked on this, but the emoji and text fields are the source of truth ones, while hexcode is either unqualified, qualified, or the default variant I think. It's the value parsed from the left column of these data files: https://github.com/milesj/emojibase/blob/master/packages/generator/src/parsers/parseData.ts#L38

chop-suey commented 1 week ago

But still the emoji and text does not always contain the correct sequence, see my example for "umbrella with rain drops". I just realized, there are also other representations of emoji in https://github.com/milesj/emojibase/blob/master/packages/data/meta/hexcodes.json. Is the hexcode in data.raw.json supposed to be used as key to get the matching mapping in hexcodes.json?

The entry for "umbrella with raind drops" in hexcodes.json looks like this:

"2614": {
  "2614": 0,
  "2614-FE0F": 0,
  "2614-FE0E": 0
}

According to this, all the entries are fully qualified, but in https://www.unicode.org/Public/emoji/15.1/emoji-test.txt it looks like only 2614 should be treated as fully qualified.