neer1304 / concurrent-trees

Automatically exported from code.google.com/p/concurrent-trees
0 stars 0 forks source link

Expose API to get longest prefix match / "closest" keys #2

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

Please add support for querying the longest prefix match (this is not currently 
exposed in the public API). A variation of this which only matches if the key 
is in fact truly a prefix would be useful, otherwise the caller would have to 
run an additional prefix.startsWith(key). For example, if the tree only 
contains "foo", it's the longest prefix match for anything. But in the use-case 
I have, I would only want to match "foo*".

Original issue reported on code.google.com by phraktle on 19 Nov 2012 at 10:33

GoogleCodeExporter commented 9 years ago
As we discussed separately, this would be a great addition to the API, thanks 
for the suggestion.

I guess this would be an enhancement for auto-complete in search boxes. If the 
user narrowed their search too far, such that it eliminated all suggestions, 
this would fall back to the "closest" matches.

The requirement is basically to provide a public API to this currently 
non-public functionality: 
http://code.google.com/p/concurrent-trees/source/browse/concurrent-trees/trunk/s
rc/main/java/com/googlecode/concurrenttrees/radix/ConcurrentRadixTree.java#727

To clarify, is this the behaviour you want?

    RadixTree<Integer> tree = new ConcurrentRadixTree<Integer>(new DefaultCharArrayNodeFactory());

    tree.put("Ford Focus", 1);
    tree.put("Ford Mondeo", 2);
    tree.put("BMW M3", 3);

    System.out.println(tree.getClosestKeys("Ford Fiesta")); // prints [Ford Focus, Ford Mondeo]

Could you also provide a code example like that above, to clarify the variation 
you mention? Thanks.

Original comment by ni...@npgall.com on 19 Nov 2012 at 10:48

GoogleCodeExporter commented 9 years ago
Sticking with the example above, for my particular use case I would want 
something like:

tree.getPrefixMatch("Jaguar")   => null

... whereas there might be some legitimate use cases where you really need the 
closest match, which is always non-null, unless the tree is empty. If you 
design the API that way (which is how this Patricia Trie implementation chose 
to: http://code.google.com/p/patricia-trie/wiki/Examples), I would need to 
perform additional string matching on the results.

Not sure what's an elegant way to do all this without bloating the API too much.

Original comment by phraktle on 19 Nov 2012 at 10:58

GoogleCodeExporter commented 9 years ago
Is the requirement that the "closest" matching key must match at least the 
first character of the key provided? That would seem sensible default behaviour 
(otherwise keys which really don't match at all would be returned).

Original comment by ni...@npgall.com on 19 Nov 2012 at 11:03

GoogleCodeExporter commented 9 years ago
For my particular use-case, for:

String matchingKey = tree.getPrefixMatchKey(prefix);

I only need matches where:

assert prefix.startsWith(matchingKey)

So, for a tree containing the single item "foo", I need matches like:

"foo"  => "foo"
"foo bar" => "foo"
"bar" => null

Of course this can be done as post-filtering, but that can be relatively 
costly. Of course there might be different use-cases for other users, where a 
different strategy would be appropriate.

Original comment by phraktle on 19 Nov 2012 at 11:15

GoogleCodeExporter commented 9 years ago
Hmm yes I see. That seems a slightly unusual use case. There might be a further 
use case, to restrict the matches returned to only the closest matches, but not 
their descendants. Or even to limit the number of matches returned.

I'll bear these use cases in mind when designing it. I am worried about API 
bloat. For example, due of concurrency, this will require 3 methods - 
getClosestMatchingKeys, getValuesForClosestMatchingKeys, 
getKeyValuePairsForClosestMatchingKeys - because it's not enough to make two 
separate calls, to retrieve the keys, and then to look up the values of those 
keys, as the contents of the tree might have changed in the meantime. 
Nonetheless, I think this would be worthwhile.

I'll avoid the word "prefix" in the method names to avoid confusion, because 
the existing methods do prefix retrieval from the tree, whereas in this case 
that would apply, but also using prefixes from the key. "Closest" seems 
intuitive enough.

The methods might accept an enum parameter to specify the "closest match 
algorithm" required, to satisfy the various use cases.

Original comment by ni...@npgall.com on 19 Nov 2012 at 11:36

GoogleCodeExporter commented 9 years ago

It's indeed difficult to accommodate all use-cases in the wild with a concise 
API. I think this is the reason why Google's Guava still does not have a Trie 
implementation. See this thread: 
http://code.google.com/p/guava-libraries/issues/detail?id=10

It might be worth considering passing in a matching strategy as a parameter, 
which could reduce the number of methods and provide some extensibility.

eg. tree.matchKeys( PrefixTreeMatchers.CLOSEST, key )
or tree.matchKeys( new MyLittleMatcher(...), key ).

But as I'm not familiar with the implementation, not sure if this would really 
help. In the end it could be left as post-filtering to the caller, but 
performing these matches directly in the trie-structure would be optimal.

Original comment by phraktle on 19 Nov 2012 at 11:51

GoogleCodeExporter commented 9 years ago
ConcurrentRadixTree currently uses a similar pattern internally for traversal, 
but it's a couple of steps away from the public API methods.

It would be better to suit well the common use cases than to have a complex API 
to suit them all. I'll think about this, hopefully there's a middle ground. I 
hope to get to this in the next week or two.

Original comment by ni...@npgall.com on 19 Nov 2012 at 4:28

GoogleCodeExporter commented 9 years ago
Added some groundwork for getClosestKeys() to trunk. It works, but the 
behaviour is not final as some of the required functionality needs to be pinned 
down.

This unit tests demonstrates the current behaviour and some of the edge cases:

    @Test
    public void testGetClosestKeys() {
        ConcurrentRadixTree<Integer> tree = new ConcurrentRadixTree<Integer>(nodeFactory);
        tree.put("COD", 1);
        tree.put("CODFISH", 2);
        tree.put("COFFEE", 3);

        //    ○
        //    └── ○ CO
        //        ├── ○ D (1)
        //        │   └── ○ FISH (2)
        //        └── ○ FFEE (3)

        assertEquals("[COD, CODFISH, COFFEE]", tree.getClosestKeys("COW").toString());
        assertEquals("[COD, CODFISH, COFFEE]", tree.getClosestKeys("CX").toString());
        assertEquals("[COD, CODFISH]", tree.getClosestKeys("COD").toString());
        assertEquals("[COFFEE]", tree.getClosestKeys("COF").toString());
        assertEquals("[]", tree.getClosestKeys("DO").toString());
        assertEquals("[CODFISH]", tree.getClosestKeys("CODFISHES").toString());
    }

Points:
* for input "COW", perhaps only COD and COFFEE could be returned, i.e. the two 
(equally distant) closest matches to COW, and not additionally descendants of 
those keys.
* for input "COW", the implementation does not satisfy the use case discussed 
above: "COW".startsWith("COD") == false
* for input "CODFISHES", the implementation satisfies the use case discussed 
above: "CODFISHES".startsWith("CODFISH") == true

Given the time of year, next month is a more likely timeframe for the remaining 
items :)

Original comment by ni...@npgall.com on 19 Dec 2012 at 11:40

GoogleCodeExporter commented 9 years ago
This is now fully implemented in trunk. It is implemented as part of some other 
major refactorings to improve performance, but those refactorings incidentally 
allowed a fairly elegant solution to this problem of allowing various matching 
patterns above.

I have reimplemented all queries on the trees, to use lazy evaluation. This 
changes the public APIs of all trees slightly, so I've bumped the major version 
number of the project, and this feature will now be in version 2.0.

The behaviour of getClosestKeys(), is the same as in December, above. However, 
instead of returning a Set of closest keys, the API now returns an Iterable of 
closest keys.

The difference is that unlike a Set, the Iterable does not actually contain any 
elements. It is only when the application starts iterating through the 
iterable, that each next element in turn is computed. The act of iterating 
through results, drives each step of the traversal algorithms, to locate the 
next result to return. This is more memory efficient, because no "results" are 
actually stored; they are returned as soon as they are found. It also reduces 
latency, because the first result is returned as soon as it is found, rather 
than after all results from 1 to N have been computed. If a lot of keys would 
previously have been matched, this could reduce latency significantly. If the 
application only requires the first 5 results and then stops iterating, then 
only 5 elements would have been computed.

For this particular feature, of computing the closest keys/longest prefix 
match, it also means that the application can apply arbitrary filtering logic 
to the keys returned inside its iteration loop, and that approach will be as 
efficient as if the traversal algorithm itself was implementing the filtering.

So getClosestKeys() will still return keys as discussed in December, but with 
this approach it will be perfectly efficient to perform any additional 
filtering inside the iteration loop.

2.0 should be officially released in the next week or two.

Original comment by ni...@npgall.com on 26 Feb 2013 at 4:24

GoogleCodeExporter commented 9 years ago
Interesting! How does the iterator relate to concurrent changes? For example, 
is there a chance to get the same match twice?

Original comment by phraktle on 26 Feb 2013 at 12:37

GoogleCodeExporter commented 9 years ago
2.0 is now officially released. The non-maven jar is available on the Download 
page, and the maven jars should sync to maven central in the next few hours.

Regarding concurrent changes, changes are applied to the tree atomically, and 
the traversal algorithms are oblivious to those changes, they always traverse a 
consistent version of the tree. So no, these changes should not cause the same 
key to be returned more than once. (Unless there is a bug; however the changes 
have good test coverage - but please report it if you find that happening.)

There is an exception to the above: in the *suffix* tree (note: not the radix 
trees), I found that with the changes, it was possible to have the same key 
returned more than once, but it wasn't due to concurrency. Basically, if the 
tree contains the keyword 'BANANA', and you call tree.getKeysContaining("AN"), 
previously, key 'BANANA' would have been returned once because it was returned 
in a Set. However the Iterable, would return it twice. But it was because 
'BANANA' actually contains the substring 'AN' twice.

I added some lightweight internal filtering in the suffix tree (based on an 
algorithm in CQEngine [1]), to suppress duplicates like this, so the results 
returned should be the same as in concurrent-trees 1.0.

I've added a ReleaseNotes page and documented the changes in 
FrequentlyAskedQuestions in the wiki tab.

[1] MATERIALIZE deduplication algorithm used in CQEngine: 
http://code.google.com/p/cqengine/wiki/DeduplicationStrategies#Materialize_Strat
egy

Original comment by ni...@npgall.com on 26 Feb 2013 at 11:07