takawitter / trie4j

PATRICIA, Double Array, LOUDS Trie implementations for Java
Apache License 2.0
174 stars 31 forks source link

Question: Substring search possible? #34

Closed sanjeevgour closed 7 years ago

sanjeevgour commented 7 years ago

Is is possible to search a word contained within some text. For example, search the word "world" in the string "theworldisaniceplace"?

takawitter commented 7 years ago

try this:

        Trie t = new PatriciaTrie("world");
        String target = "theworldisaniceplace";
        StringBuilder ret = new StringBuilder();
        t.findLongestWord(target, 0, target.length(), ret);
        System.out.println("found: " + ret);
sanjeevgour commented 7 years ago

Thanks, that works great, appreciate your response. Few more questions- -Can I say stop and return at first match whether it is shortest or longest doesn't matter? -Can I get all matching words from the dictionary I gave as input. For example, if my earlier dictionary also contained the word "nice", I would like to get a list containing "world" and "nice". -Is it possible to return the index of the words found in the input text?

These are all use cases I have for my application.

takawitter commented 7 years ago

-Can I say stop and return at first match whether it is shortest or longest doesn't matter?

Try findShortestWord method. That stops search when the first shortest word found while findLongestWord method keeps searching if longer word exists in a dictionary.

-Can I get all matching words from the dictionary I gave as input. For example, if my earlier dictionary also contained the word "nice", I would like to get a list containing "world" and "nice".

You have to write a loop using find***Word method like this:

        Trie t = new PatriciaTrie("world", "nice");
        String target = "theworldisaniceplace";
        List<CharSequence> found = new ArrayList<>();
        int pos = 0;
        StringBuilder ret = new StringBuilder();
        while((pos = t.findLongestWord(target, pos, target.length(), ret)) != -1){
            found.add(ret);
            pos += ret.length();
            ret = new StringBuilder();
        }
        System.out.println(found);

-Is it possible to return the index of the words found in the input text?

Use the returned value from find***Word method. That indicates the index of found word in input text.

sanjeevgour commented 7 years ago

That solved most of my use cases, thanks for quick response and help.

sanjeevgour commented 7 years ago

Can I make the search case insensitive?

takawitter commented 7 years ago

You're very welcome:)

Can I make the search case insensitive?

No. To do case-insensitive search, you have to store capitalized string and use capitalized query string, or implement a new method for Trie4J in saving original case.

sanjeevgour commented 7 years ago

Ok, case-change is what I am doing currently. Thanks.

sanjeevgour commented 7 years ago

One more questions. Among the different tree implementations you have, which one would be the least expensive processing wise, in other words which one would take least CPU cycles?

takawitter commented 7 years ago

Depends on the function. DoubleArray takes the least computation power for contains method. But it takes more power for predictiveSearch method than PatriciaTrie.

sanjeevgour commented 7 years ago

I am using InlinedTailLOUDSTrie and mostly use findShortestWord and find the findLongestWord methods. It appears that this combination takes significant CPU cycles. I am wondering if I could tune it.

takawitter commented 7 years ago

DoubleArrays use less CPU time than LOUDSTries in find????Word method. According to the current source code, both DoubleArrays and LOUDSTries seem to have little room to optimize speed by reducing object creation.

takawitter commented 7 years ago

I'm benchmarking and optimizing the findShortestWord methods of some tries.

1942.908 ms in 100000call: org.trie4j.patricia.TailPatriciaTrie
2408.908 ms in 100000call: org.trie4j.doublearray.DoubleArray
 178.622 ms in 100000call: org.trie4j.doublearray.OptDoubleArray
2716.288 ms in 100000call: org.trie4j.louds.InlinedTailLOUDSTrie
 619.349 ms in 100000call: org.trie4j.louds.OptInlinedTailLOUDSTrie

The result shows that the findShortestWord methods of DoubleArray and InlinedTailLOUDSTrie can be significantly improved.

takawitter commented 7 years ago

Trie4J 0.9.6 with improvements above is out. Here is the latest bench results: https://github.com/takawitter/trie4j/blob/master/trie4j/benchmarks/20170720-2.6GHzCorei7-OSX10.11.6-jdk1.8.0_111.findWord.txt All improvements for Opt*** classes were integrated to non-Opt classes of 0.9.6 (i.e. improvements for OptDoubleArray were integrated to DoubleArray).

sanjeevgour commented 7 years ago

These are great improvements, almost 5 time. :+1: