npgall / concurrent-trees

Concurrent Radix and Suffix Trees for Java
Apache License 2.0
511 stars 82 forks source link

New types of tree: ConcurrentPermutermTree, ConcurrentWildcardTree for wildcard queries #3

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
It would be useful to support wildcard queries.

Two approaches to be investigated (both of which will be tracked in this issue):

(1) A permuterm index on top of the ConcurrentRadixTree. This would support 
queries such as "<prefix>*<suffix>" on a single tree. It may be more memory 
efficient than a hash-dictionary approach. See: 
http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html

(2) A composite of a ConcurrentRadixTree and a ConcurrentReversedRadixTree. One 
tree would support prefix lookup, the other suffix lookup. Query 
"prefix*suffix" may return the intersection of the results from both trees, 
after some post-filtering. This second approach however, is near the territory 
of a query engine on top of multiple indexes, so if implemented would not 
belong in this project, but in http://code.google.com/p/cqengine/

Example usage for (1) would be:

public static void main(String[] args) {
    PermutermTree<Integer> tree = new ConcurrentPermutermTree<Integer>(new DefaultCharArrayNodeFactory());

    tree.put("TEST", 1);
    tree.put("TOAST", 2);
    tree.put("TEAM", 3);

    System.out.println("Keys matching 'T*T': " + Iterables.toString(tree.getKeysMatching("T", "T"))); // prefix, suffix
}

Output would be:
    Keys matching 'T*T': [TOAST, TEST]

Original issue reported on code.google.com by ni...@npgall.com on 24 Mar 2013 at 10:19

GoogleCodeExporter commented 9 years ago
Can a ConcurrentWildcardTree be used for url routing with parameters? like 

/home/products/*/tab/*?color=* ?

Original comment by respriet...@gmail.com on 16 Jan 2015 at 9:19

GoogleCodeExporter commented 9 years ago
I'm planning to take a different/better approach than discussed above to 
support multiple wildcard queries, when I eventually get time for this.

Basically a potentially better way to support multiple wildcard lookup, is to 
set up a tree of InvertedRadixTrees.
Each InvertedRadixTree has a getKeysPrefixing method, which returns keys in the 
tree which match an input string with a trailing wildcard.

If you set up a tree of these InvertedRadixTree, then you can accelerate 
lookups where your url patterns contain any number of wildcards.

Original comment by ni...@npgall.com on 19 Jan 2015 at 1:19

w4nderlust commented 8 years ago

This feature would be extremely useful for many use cases, misspell detection being one of them.

gembin commented 8 years ago

+1, any plan to implement this feature?

npgall commented 8 years ago

I have not had time to work on this recently, but I'll try to take a look at this again in the next few weeks.

gembin commented 7 years ago

any progress on wildcard tree?

npgall commented 7 years ago

Not really I'm adraid. I have been distracted with other ideas (In CQEngine mainly) so this is on the back burner for now. Hopefully I will get time to revisit this.

jbduncan commented 3 years ago

@npgall I'm interested in this feature, as I'm developing a period-of-concept for an IntelliJ-style auto-completion feature in the app my team is working on, which I'm hoping to do without hacking things too much, or resorting to copying code from IntelliJ's codebase or deploying Elasticsearch or Apache Solr.

The way I imagine this auto-completion feature working is, if the app has this list of possible completion suggestions built in:

apple
boat
cat

Then when the customer enters one of the following search queries into the app's search bar, then the respective completion suggestions are shown below the search bar:

  1. Search query: "b" -> Suggestions: ["boat"]
  2. Search query: "pl" -> Suggestions: ["apple"]
  3. Search query: "ct" -> Suggestions: ["cat"]
  4. Search query: "t" -> Suggestions: ["boat", "cat"]
  5. Search query: "a" -> Suggestions: ["apple", "boat", "cat"]

My app would translate queries like "pl" into "*p*l*", so as to be compatible with the proposed ConcurrentWildcardTree.

I've tried ConcurrentRadixTree, but it doesn't work as expected for queries (2)-(5). I've also tried ConcurrentSuffixTree's suffix and "contains" capabilities, but again it doesn't work for all queries.

I've even tried Apache Lucene, but it doesn't work because my app is Android-based and Lucene isn't compatible with Android.