neosmart / unicode.net

A Unicode library for .NET, supporting UTF8, UTF16, and UTF32. With an extra helping of emoji for good measure πŸ”₯🌢️😁
MIT License
88 stars 23 forks source link
c-sharp dotnet unicode

Unicode.net- an emoji and text-processing library for .NET

Unicode.net is an easy-to-use Unicode text-processing library for dot net, designed to complement the BCL and the System.String class, useable on both .NET Framework and .NET Core/UWP (.NET Standard) targets. As an added bonus, Unicode.net includes an extra helping of emoji awesomeness πŸŽ‰ 😊 πŸ˜„.

Unicode.net is available on NuGet for all .NET platforms and versions, and is made open source by NeoSmart Technologies under the terms of the MIT License. Contributions are welcomed and appreciated.

.NET is not natively Unicode-aware, while the API has full support for internationalization by using UTF-16 strings aware, capable of passing Unicode-encoded text and carrying out operations involving non-English/non-ASCII text data, the interface is almost exclusively a black box, and the abstraction fails once attempts are made to actually access the underlying string data (i.e. indexing a Unicode string containing non-ASCII data returns individual 16-bit values rather than complete Unicode sequences referring to letters or symbols).

The Unicode.net puzzle pieces

It's best to have a basic understanding of how the various components of the Unicode.net library fit together before diving in to the code. Unicode.net was purposely designed with the .NET Framework and the C# language in mind, it is designed to complement, not supplant, the existing string manipulation functions. Most importantly, Unicode.net does not do away with the string class, instead, it embraces and extends it with more Unicode goodness (C#'s extension methods make this beautifully easy).

A quick Unicode primer

It is unfortunately impossible to use this library without having a very basic understanding of text encoding and Unicode in general. While this section may be extended in the future, for now, a basic understanding of how text must be encoded according to a certain spec to form strings out of a sequence of bytes is a necessary prerequisite. Unicode is one such format, and is the primary standard when it comes to representing non-English content in a standard, binary format.

Currently, any "letter" in any language can be expressed as a sequence of one or more Unicode codepoints. A codepoint is the basic building block of a Unicode string, somewhat like how a 8-bit character can be considered the basic building block of an ASCII string. A Unicode codepoint is 32-bits in length, or 4 bytes. Unicode itself is not an encoding per-se, but is rather comprised of 3 different encodings: UTF-8, UTF-16 (or what .NET refers to everywhere as Unicode), and UTF-32. These encode a single Unicode codepoint out of 8-bit, 16-bit, and 32-bit sequences respectively. Unicode codepoints "small enough" to fit in a single byte can be represented in just one UTF-8 character, just as those "small enough" to fit in 2 bytes can be represented in just one UTF-16 "character," and (currently) any Unicode codepoint can be represented as a single UTF-32 codepoint. But codepoints "too big" to fit in a single 8-bit struct must be "split up" into separate 8-bit components to be represented as UTF-8, and the same for those too big to fit in a 16-bit struct for UTF-16, etc.

A Unicode sequence is a, well, sequence of one or more Unicode codepoints (in any of the three Unicode encodings mentioned above). Such a sequence can be used to represent just one symbol (such as the Arabic ﻉ or the see-no-evil πŸ™ˆ emoji), or they can be the representation of multiple such letters used to form a sentence (i.e. a string). There exists a direct mapping from UnicodeSequence to System.String, though it is important to note that this mapping is not unique, in that multiple, different UnicodeSequence values can map to a single string (here the concept of string normalization comes into play, where there is only one "canonical" Unicode representation for any given string, but that is outside the scope of this primer).

The basic components of Unicode.net

The main classes of Unicode.net have actually already been covered, and they are Codepoint and UnicodeSequence. Here's how Codepoint, UnicodeSequence, and System.String fit together:

And here's how the Unicode Codepoint object can be represented in the various UTF-8, UTF-16, and UTF-32 encodings:

And that's all you really need to know to get started!

Quick Unicode.net function reference

The below is only a primer on the primary features of Unicode.net, being the interfaces most new developers are most likely to be interested in when first encountering this library. See the complete documentation for the complete API reference.

Extension methods for System.String:

Class Codepoint

The Codepoint class, representing a single Unicode codepoint in an encoding-agnostic format:

Class UnicodeSequence

The UnicodeSequence class, representing a combination of a Unicode-encoded string in an encoding-agnostic format, that can be decomposed into its individual Codepoint values:

The Unicode.net Emoji API

What's the point of a Unicode text-processing library that does not provide an API for dealing with emoji? After all, emoji are probably the single-biggest driver behind Unicode adoption in recent years!

Class Emoji

The static Emoji class is the main entry point for dealing with emoji in Unicode.net.

Class SingleEmoji

The SingleEmoji class is a representation of a single "emoji," where "emoji" is any unicode sequence comprised of one or more basic emoji sequences that should be represented by a single glyph, per the UTR #51 spec. Again, depending on your platform and font and the emoji they support, a SingleEmoji may either have no representation or be represented as a sequence of one or more individual emoji.

Important note: this class is called SingleEmoji and the "master" emoji class is called Emoji because we firmly believe that "emoji" — as a foreign word derived from the Japanese γˆγ‚‚γ˜γ‚‰γ‚“γ© — is a zero plural marker noun, which is to say, a noun with no plural form distinct from its singular form. The plural of "emoj" is "emoji" and absolutely never "emojis," which is quite simply not a word at all.

That said, the SingleEmoji class contains all the information needed to represent a single glyph from the UTR spec, and to interact with its individual Unicode codepoints via the Unicode API described elsewhere in these docs: