neosmart / unicode.net

A Unicode library for .NET, supporting UTF8, UTF16, and UTF32. With an extra helping of emoji for good measure 🔥🌶️😁
MIT License
87 stars 23 forks source link

More emoji groups and subgroups #7

Open COM8 opened 5 years ago

COM8 commented 5 years ago

I would suggest splitting up all emojis in their groups and subgroups like it's done here. At the moment we only have All and Basic.

I think I could extend the emoji-importer.html for this.

COM8 commented 5 years ago

I've written a simple parser for the emoji-test.txt files. Emoji-List-Parser

Once #6 has been merged I will create a PR that updates everything to Unicode 12.0 and adds more (sub)groups.

mqudsi commented 5 years ago

That's not a bad idea, but it's going to be a bit more complicated than that because the purpose of the Basic list is to provide emoji that can be displayed in an emoji picker, without variants, and more importantly, filtering for native support (currently under Windows 10).

.NET Core is of course cross-platform, so the choice of Segoe UI as the font that determines whether or not a glyph is supported is no longer a no-brainer. It should probably be factored out to either a property of the font/platform or a separate table specifically for the font.

In all cases, we definitely should extend emoji-importer.html rather than starting from scratch as it already checks for font support and, more importantly, it can be run without any installed dependencies (aside from a web browser) -- although it isn't fully automated or CI-ready as it currently requires user intervention.

COM8 commented 5 years ago

I don't really get why the step with filtering emoji is necessary. Since people are able to define their own fonts with symbols for for example Unicode 13 before MS releases a new version of their font supporting all new emojis.

OR if you run a newer version of Windows (Insider Previews/...) you wouldn't have access to all emojis since they were not supported on your PC.

Also it makes the collection incomplete. I'm a fan of providing all possibilities to the user and he should then decide in his app if he/she likes to keep/show those unsupported emojis or not.

Regarding the the emoji-importer.html: Sure yes we should still extend it, but I'm probably not the right guy for this. I was tinkering around with it and since I'm not a fan of webstuff I decided to write my own one supporting all features required for more (sub)groups my self.

mqudsi commented 5 years ago

I'm not sure if you realized, but the filtered emoji list is only in addition to the full emoji list.

Think about when it would be necessary to show a list of all emoji (vs looking up an emoji by its unicode sequence or vice versa). The context is a native application providing an interface for a user to enter an emoji into an input by presenting a list of emoji. You would never want to show an emoji that does not render, displays as a tofu, or displays broken as two separate emoji rather than the intended single emoji.

The reason why this is precompiled into the application is that it is resource intensive to determine whether or not an emoji can be correctly rendered as a single glyph in a particular font, and there's no native way of figuring that out at runtime in a cross-platform manner without introducing some serious (unmanaged!) dependencies.

Of course no one is required or even asked to use the filtered list of emoji rather than the full list in developing your application - but the list is there for those that need such a feature.


I have some updates for the importer locally committed that I need to push out. The importer itself doesn't need a lot of work, and updating to a newer version of the Unicode spec is as simple as replacing the emoji-test file with the latest (presuming there aren't any lexical changes needed).

There's also significant logic in the importer to create the list of keywords from the names of emoji, to convert emoji names to useful and friendly symbol names, etc. all of which was only developed because it was necessary and isn't there just for show.

I'm more than happy to adapt the importer to include the subgroup info; I'm just debating whether or not to introduce separate lists for each subgroup or to include the subgroup as a property of the emoji.

COM8 commented 5 years ago

Ok thanks! I get it.

Over the day I was working on extending my parser and added group, subgroup and skinTone support to my fork of unicode.net.

I also added 10 new lists for all groups. Adding lists for every subgroup (~97) would be a little bit too much it think.

COM8 commented 5 years ago

Btw. I think you should remove the seguiemj.ttf file from the repo since (correct me if I'm worng😉) you probably do not have the rights to publish it here since you have to buy it if you are not on Windows.

mqudsi commented 5 years ago

Nice. I just pushed some commits with major updates to the parser, including added support for group and subgroup. I haven't updated the C# assets yet.

I have a few concerns regarding keeping these lists in memory at all times. I've actually been wondering if it's not better to change the SingleEmoji instances from fields to properties so that they have ~zero memory overhead until invoked, but that leaves the question of whether or not lists that can be determined at runtime without resorting to reflection (i.e. not the font-filtered list or Emoji.All) should be generated dynamically via Linq (which the .NET Core team in general has now taken to eschewing for performance reasons).

Most likely the best compromise is for them to be dynamically generated on first access and then cached thereafter.