gmamaladze / trienet

.NET Implementations of Trie Data Structures for Substring Search, Auto-completion and Intelli-sense. Includes: patricia trie, suffix trie and a trie implementation using Ukkonen's algorithm.
MIT License
425 stars 67 forks source link

Some issue with Unicode characters, maybe #5

Open Siderite opened 6 years ago

Siderite commented 6 years ago

Got an exception

System.ArgumentOutOfRangeException: startIndex cannot be larger than length of string. Parameter name: startIndex at System.String.Substring(Int32 startIndex, Int32 length) at Gma.DataStructures.StringSearch.UkkonenTrie1.TestAndSplit(Node1 inputs, String stringPart, Char t, String remainder, T value) at Gma.DataStructures.StringSearch.UkkonenTrie1.Update(Node1 inputNode, String stringPart, String rest, T value) at Gma.DataStructures.StringSearch.UkkonenTrie`1.Add(String key, T value) at TPB.Business.PirateBayDumpProcessor.Process(FileInfo file) in D:_Projects\TPB\TPB.Business\PirateBayDumpProcessor.cs:line 57 at TPB.ConsoleTester.Program.Main(String[] args) in D:_Projects\TPB\TPB.ConsoleTester\Program.cs:line 12} | System.ArgumentOutOfRangeException

when trying (pun not intended): trie.Add(entry.Name, entry); where entry.Name was Tjockare än vatten (Thicker Than Water) - S02 E08 - 720p x265 H

af-mst commented 5 years ago

Hi,

we have a similar issue. The Add was wokring all the time until someone added a word with an "ss" in the database. After some data research we found out, that normaly we add those words with the character "ß" (we are from germany, "ss" and "ß" as interchangable ;) So we pulled the code and debugged it.

The issue appeared here:

            var newEdge = new Edge<T>(remainder, newNode);
            e.Label = e.Label.Substring(remainder.Length);
            newNode.AddEdge(e.Label[0], e); // !!! HERE !!!
            s.AddEdge(t, newEdge);

(UkkonenTrie.cs -> Line: 207)

word: "walross" remainder at that point: "oss" e.label at that point = "oße" and "e.Label = e.Label.Substring(remainder.Length);" will result in an empty string instead of the "e", which lets the next line fail with an out of index exception: "newNode.AddEdge(e.Label[0], e);"

I guess, that you internally transform the "ß" to ss? Or that the code is interpretating the "ss" as "ß"? Anyhow the code wants to use the "oss" node for the "oß" value :(

Our current workaround is to tralce all "ß" with "ss" and thats it, but it has annoying implications.

Thank you

Kind Regards

a7744hsc commented 3 years ago

This issue seems like being caused by globalization, I solved this issue by add the following runtime option: { "runtimeOptions": { "configProperties": { "System.Globalization.Invariant": true } } }

UPDATED======

For my case, the root cause of this issue is that at least in "en-US" and "中文(中国)" Culture, "ANYSTR".StartsWith("ANYSTR\u200B") returns True This issue happens on Linux (for my case Ubuntu 18.04) but does not exist on Windows 10。