Closed GoogleCodeExporter closed 9 years ago
As we discussed separately, this would be a great addition to the API, thanks
for the suggestion.
I guess this would be an enhancement for auto-complete in search boxes. If the
user narrowed their search too far, such that it eliminated all suggestions,
this would fall back to the "closest" matches.
The requirement is basically to provide a public API to this currently
non-public functionality:
http://code.google.com/p/concurrent-trees/source/browse/concurrent-trees/trunk/s
rc/main/java/com/googlecode/concurrenttrees/radix/ConcurrentRadixTree.java#727
To clarify, is this the behaviour you want?
RadixTree<Integer> tree = new ConcurrentRadixTree<Integer>(new DefaultCharArrayNodeFactory());
tree.put("Ford Focus", 1);
tree.put("Ford Mondeo", 2);
tree.put("BMW M3", 3);
System.out.println(tree.getClosestKeys("Ford Fiesta")); // prints [Ford Focus, Ford Mondeo]
Could you also provide a code example like that above, to clarify the variation
you mention? Thanks.
Original comment by ni...@npgall.com
on 19 Nov 2012 at 10:48
Sticking with the example above, for my particular use case I would want
something like:
tree.getPrefixMatch("Jaguar") => null
... whereas there might be some legitimate use cases where you really need the
closest match, which is always non-null, unless the tree is empty. If you
design the API that way (which is how this Patricia Trie implementation chose
to: http://code.google.com/p/patricia-trie/wiki/Examples), I would need to
perform additional string matching on the results.
Not sure what's an elegant way to do all this without bloating the API too much.
Original comment by phraktle
on 19 Nov 2012 at 10:58
Is the requirement that the "closest" matching key must match at least the
first character of the key provided? That would seem sensible default behaviour
(otherwise keys which really don't match at all would be returned).
Original comment by ni...@npgall.com
on 19 Nov 2012 at 11:03
For my particular use-case, for:
String matchingKey = tree.getPrefixMatchKey(prefix);
I only need matches where:
assert prefix.startsWith(matchingKey)
So, for a tree containing the single item "foo", I need matches like:
"foo" => "foo"
"foo bar" => "foo"
"bar" => null
Of course this can be done as post-filtering, but that can be relatively
costly. Of course there might be different use-cases for other users, where a
different strategy would be appropriate.
Original comment by phraktle
on 19 Nov 2012 at 11:15
Hmm yes I see. That seems a slightly unusual use case. There might be a further
use case, to restrict the matches returned to only the closest matches, but not
their descendants. Or even to limit the number of matches returned.
I'll bear these use cases in mind when designing it. I am worried about API
bloat. For example, due of concurrency, this will require 3 methods -
getClosestMatchingKeys, getValuesForClosestMatchingKeys,
getKeyValuePairsForClosestMatchingKeys - because it's not enough to make two
separate calls, to retrieve the keys, and then to look up the values of those
keys, as the contents of the tree might have changed in the meantime.
Nonetheless, I think this would be worthwhile.
I'll avoid the word "prefix" in the method names to avoid confusion, because
the existing methods do prefix retrieval from the tree, whereas in this case
that would apply, but also using prefixes from the key. "Closest" seems
intuitive enough.
The methods might accept an enum parameter to specify the "closest match
algorithm" required, to satisfy the various use cases.
Original comment by ni...@npgall.com
on 19 Nov 2012 at 11:36
It's indeed difficult to accommodate all use-cases in the wild with a concise
API. I think this is the reason why Google's Guava still does not have a Trie
implementation. See this thread:
http://code.google.com/p/guava-libraries/issues/detail?id=10
It might be worth considering passing in a matching strategy as a parameter,
which could reduce the number of methods and provide some extensibility.
eg. tree.matchKeys( PrefixTreeMatchers.CLOSEST, key )
or tree.matchKeys( new MyLittleMatcher(...), key ).
But as I'm not familiar with the implementation, not sure if this would really
help. In the end it could be left as post-filtering to the caller, but
performing these matches directly in the trie-structure would be optimal.
Original comment by phraktle
on 19 Nov 2012 at 11:51
ConcurrentRadixTree currently uses a similar pattern internally for traversal,
but it's a couple of steps away from the public API methods.
It would be better to suit well the common use cases than to have a complex API
to suit them all. I'll think about this, hopefully there's a middle ground. I
hope to get to this in the next week or two.
Original comment by ni...@npgall.com
on 19 Nov 2012 at 4:28
Added some groundwork for getClosestKeys() to trunk. It works, but the
behaviour is not final as some of the required functionality needs to be pinned
down.
This unit tests demonstrates the current behaviour and some of the edge cases:
@Test
public void testGetClosestKeys() {
ConcurrentRadixTree<Integer> tree = new ConcurrentRadixTree<Integer>(nodeFactory);
tree.put("COD", 1);
tree.put("CODFISH", 2);
tree.put("COFFEE", 3);
// ○
// └── ○ CO
// ├── ○ D (1)
// │ └── ○ FISH (2)
// └── ○ FFEE (3)
assertEquals("[COD, CODFISH, COFFEE]", tree.getClosestKeys("COW").toString());
assertEquals("[COD, CODFISH, COFFEE]", tree.getClosestKeys("CX").toString());
assertEquals("[COD, CODFISH]", tree.getClosestKeys("COD").toString());
assertEquals("[COFFEE]", tree.getClosestKeys("COF").toString());
assertEquals("[]", tree.getClosestKeys("DO").toString());
assertEquals("[CODFISH]", tree.getClosestKeys("CODFISHES").toString());
}
Points:
* for input "COW", perhaps only COD and COFFEE could be returned, i.e. the two
(equally distant) closest matches to COW, and not additionally descendants of
those keys.
* for input "COW", the implementation does not satisfy the use case discussed
above: "COW".startsWith("COD") == false
* for input "CODFISHES", the implementation satisfies the use case discussed
above: "CODFISHES".startsWith("CODFISH") == true
Given the time of year, next month is a more likely timeframe for the remaining
items :)
Original comment by ni...@npgall.com
on 19 Dec 2012 at 11:40
This is now fully implemented in trunk. It is implemented as part of some other
major refactorings to improve performance, but those refactorings incidentally
allowed a fairly elegant solution to this problem of allowing various matching
patterns above.
I have reimplemented all queries on the trees, to use lazy evaluation. This
changes the public APIs of all trees slightly, so I've bumped the major version
number of the project, and this feature will now be in version 2.0.
The behaviour of getClosestKeys(), is the same as in December, above. However,
instead of returning a Set of closest keys, the API now returns an Iterable of
closest keys.
The difference is that unlike a Set, the Iterable does not actually contain any
elements. It is only when the application starts iterating through the
iterable, that each next element in turn is computed. The act of iterating
through results, drives each step of the traversal algorithms, to locate the
next result to return. This is more memory efficient, because no "results" are
actually stored; they are returned as soon as they are found. It also reduces
latency, because the first result is returned as soon as it is found, rather
than after all results from 1 to N have been computed. If a lot of keys would
previously have been matched, this could reduce latency significantly. If the
application only requires the first 5 results and then stops iterating, then
only 5 elements would have been computed.
For this particular feature, of computing the closest keys/longest prefix
match, it also means that the application can apply arbitrary filtering logic
to the keys returned inside its iteration loop, and that approach will be as
efficient as if the traversal algorithm itself was implementing the filtering.
So getClosestKeys() will still return keys as discussed in December, but with
this approach it will be perfectly efficient to perform any additional
filtering inside the iteration loop.
2.0 should be officially released in the next week or two.
Original comment by ni...@npgall.com
on 26 Feb 2013 at 4:24
Interesting! How does the iterator relate to concurrent changes? For example,
is there a chance to get the same match twice?
Original comment by phraktle
on 26 Feb 2013 at 12:37
2.0 is now officially released. The non-maven jar is available on the Download
page, and the maven jars should sync to maven central in the next few hours.
Regarding concurrent changes, changes are applied to the tree atomically, and
the traversal algorithms are oblivious to those changes, they always traverse a
consistent version of the tree. So no, these changes should not cause the same
key to be returned more than once. (Unless there is a bug; however the changes
have good test coverage - but please report it if you find that happening.)
There is an exception to the above: in the *suffix* tree (note: not the radix
trees), I found that with the changes, it was possible to have the same key
returned more than once, but it wasn't due to concurrency. Basically, if the
tree contains the keyword 'BANANA', and you call tree.getKeysContaining("AN"),
previously, key 'BANANA' would have been returned once because it was returned
in a Set. However the Iterable, would return it twice. But it was because
'BANANA' actually contains the substring 'AN' twice.
I added some lightweight internal filtering in the suffix tree (based on an
algorithm in CQEngine [1]), to suppress duplicates like this, so the results
returned should be the same as in concurrent-trees 1.0.
I've added a ReleaseNotes page and documented the changes in
FrequentlyAskedQuestions in the wiki tab.
[1] MATERIALIZE deduplication algorithm used in CQEngine:
http://code.google.com/p/cqengine/wiki/DeduplicationStrategies#Materialize_Strat
egy
Original comment by ni...@npgall.com
on 26 Feb 2013 at 11:07
Original issue reported on code.google.com by
phraktle
on 19 Nov 2012 at 10:33