neosmart / unicode.net

A Unicode library for .NET, supporting UTF8, UTF16, and UTF32. With an extra helping of emoji for good measure 🔥🌶️😁
MIT License
87 stars 23 forks source link

Split into 2 packages #16

Open igitur opened 3 years ago

igitur commented 3 years ago

Hi,

So the title says a lot and probably brings up some negative connotations, but hear me out.

I maintain ClosedXML and a user requested support for emoji, hence my other issue that I logged about strong naming the assembly.

Then we realised that the binaries are about 500kB per target, which is massive for a supposedly small little utility library. It turns out the list of enumerated emoji in Emoji-All.cs, Emoji-Basic.cs and Emoji-Emojis.cs is what makes the binary so big. If I omit those files, the binary is about 37kB, which is much more reasonable.

For our purposes, and maybe for others, we don't need the enumerated list of emoji. We just want to parse the individual "letters" in a string and check if they are emoji.

So my request / suggestion is you split this package into 2:

What is you feeling about this? This is obviously a big change and will affect all users.

On our side, our alternative is add this repo as a git submodule and compile in all the Unicode.net files that we need, and hence exclude those top 3.

Thanks

mqudsi commented 3 years ago

No negative reaction on my part. I have been mulling splitting the crate into two with a "core unicode" features project and an emoji project separately because the latter needs regular updates (the repository contains a build script to convert the deliverables from the Unicode foundation to CS files) based off of new TR bulletins while the core logic of converting unicode, etc. is rather immutable.

I realize that is insufficient for your needs if you still need to interact with emoji, as the data regarding the range of codepoints set aside for emoji is also derived from the TR updates and would need to be machine generated and would therefore be part of the second package. The question becomes whether to ship three packages (eg and for the sake of the example Core, Emoji, Emoji.All) or two (Core, Emoji) but with the drawback of both being machine-generated.

Let me sleep on this and get back to you. Breaking changes are OK if it's for a good reason (we haven't stabilized the API yet), but I want to make sure whatever approach we take ends up being the most sensible for future maintainability.

mqudsi commented 3 years ago

Ok, so the primary motivation for actually containing a list of emoji, filtering them by font support, needing to identify and coalesce duplicate entries, needing to distinguish between emoji with different compositions but the same glyph, etc. etc. was all to implement an emoji picker/chooser dialog for Windows 10, before Creators Update came out with exactly such a native control.

Revisiting the codebase now, it seems to me that this is the biggest difference between what the codebase looks like now and what the typical usages would expect it to look like. I would like to restructure the project so that anything that would require a named entity should go in one package, while everything else goes in another. This does mean that some things will need to be updated on each TR from the Unicode consortium in both packages, rather than only in one as I had originally hoped, but that's probably OK because I want to automate the entire build process (currently only the generation of the emoji lists and their properties is fully automated).