hexawyz / NetUnicodeInfo

Unicode Character Inspector & Library providing a subset of the Unicode data for .NET clients.
https://www.nuget.org/packages/UnicodeInformation/
MIT License
59 stars 11 forks source link

Feature request: Searching by name #1

Open themobydisk opened 8 years ago

themobydisk commented 8 years ago

I want to make an app that searches unicode characters by name. It looks like the UnicodeInfo class only lets me search by character. Ex: If I want to find a Unicode music note, I want to call something like:

// Returns 2669 - 266C, 1D13B - 1D164, 1F3B5, etc.
IEnumerable<int> matchingCharacters = UnicodeInfo.FindByName("note"); 
hexawyz commented 8 years ago

Hello,

What you request is a full text index on name data. Providing a full text search algorithm is well outside the scope of this library, but there are tools which can provide this feature, like sqlite. If you wish, you can already build such an index yourself by requesting the name for every possible code point (0x0000 to 0x10FFFF).

I may implement a few helper methods to help with scenarios like this one, allowing to enumerate things like valid code points or known names, but that would be it.

themobydisk commented 8 years ago

Okay, fair enough. Probably not a good fit for this project. Thanks for the reply!

KirillOsenkov commented 7 years ago

Actually you'll be surprised how simple it is, even without a full text search engine. Here's the sample code that works for me:

        private Dictionary<int, string> descriptions = new Dictionary<int, string>();

        private void BuildUnicodeList()
        {
            var blocks = UnicodeInfo.GetBlocks();

            foreach (var block in blocks)
            {
                foreach (var codepoint in block.CodePointRange)
                {
                    if (char.IsSurrogate((char)codepoint))
                    {
                        continue;
                    }

                    var charInfo = UnicodeInfo.GetCharInfo(codepoint);
                    var displayText = charInfo.Name;
                    if (displayText != null)
                    {
                        descriptions[codepoint] = displayText;
                    }
                }
            }
        }

...
            var sb = new StringBuilder();
            int hitcount = 0;
            foreach (var d in descriptions)
            {
                if (hitcount > 20)
                {
                    return sb.ToString();
                }

                if (d.Value.IndexOf(input, StringComparison.OrdinalIgnoreCase) > -1)
                {
                    sb.AppendLine(d.Key);
                    hitcount++;
                }
            }

            if (sb.Length > 0)
            {
                return sb.ToString();
            }
KirillOsenkov commented 7 years ago

The performance on my machine is about 70-80 ms per lookup, so of course having an in-memory trie or other index can significantly speed it up, however if you're OK with these numbers then it works great and is super simple.

hexawyz commented 7 years ago

Nice solution with so little code. 👍 It would likely be enough for most scenarios.

I did write some code that you can use to create an index of Unicode characters, but it's not production ready. (@themobydisk, I apologize to you. I had totally forgotten about that issue… :( )

You can try it and/or benchmark it if you want: https://gist.github.com/GoldenCrystal/0071772cd111ac4b45b21470f1ac101f It needs a bit of cleaning, but as far as I remember, the code was working. Once cleaned a bit, I will include it as an example, instead of letting it rot on my hard drive…

KirillOsenkov commented 7 years ago

BTW I'm using my naive lookup algorithm here: http://quickinfo.io/?char%20cherries

You can try searching for various emoji names, paste emoji to view their info, etc.

KirillOsenkov commented 6 years ago

FYI I've now implemented fast indexed lookup of unicode characters here: https://github.com/KirillOsenkov/QuickInfo/blob/a1b9880c0beeaaa1472d14d78e2799399795c657/src/QuickInfo/Processors/Unicode.cs#L103-L113

It requires creating an index like this: https://github.com/KirillOsenkov/QuickInfo/blob/a1b9880c0beeaaa1472d14d78e2799399795c657/src/QuickInfo/Processors/Unicode.cs#L178

The helper code is here: https://github.com/KirillOsenkov/QuickInfo/blob/master/src/QuickInfo/Utilities/SortedSearch.cs

Hope this helps!

AndreasVolkmann commented 2 months ago

I really just want and easy Name -> Char lookup. No need for search.