npgall / concurrent-trees

Concurrent Radix and Suffix Trees for Java
Apache License 2.0
510 stars 82 forks source link

InvertedSuffixTree? #23

Open jacobg opened 8 years ago

jacobg commented 8 years ago

Hi,

I'm seeing a need for an InvertedSuffixTree. The scenario is where I have a list of rule suffix strings that I would insert into an InvertedSuffixTree, and then I receive input on a string where I'd like to know whether that input string ends in any of the rule suffixes.

I was wondering if not having an InvertedSuffixTree is because there's an alternative approach to this problem, or just that it's an uncommon problem that could be implemented some day?

Thanks!

jacobg commented 8 years ago

I think the following code implements it using InvertedRadixTree:

class InvertedSuffixTree {

    private static final Object VALUE = new Object();

    private final InvertedRadixTree<Object> _invertedRadixTree;

    InvertedSuffixTree(Collection<String> suffixes) {

        _invertedRadixTree = new ConcurrentInvertedRadixTree<>(new DefaultCharArrayNodeFactory());
        suffixes.forEach(suffix -> _invertedRadixTree.put(reverseLowerCase(suffix), VALUE));
    }

    boolean containsSuffixFor(String string) {

        return _invertedRadixTree.getKeysPrefixing(reverseLowerCase(string)).iterator().hasNext();

    }

    private String reverseLowerCase(String string) {

        return new StringBuilder(string.toLowerCase()).reverse().toString();
    }
}
npgall commented 8 years ago

I just don't think this question has come up before!

So I guess a tree like that, would be a combination of the ReversedRadixTree, and the InvertedRadixTree? So - an InvertedReversedRadixTree?

Although it would therefore deal with suffixes, we probably couldn't call it an InvertedSuffixTree though, because the SuffixTree is a different data structure. (It generates the suffixes of each input string internally; and I don't think we need to do that here?)

I'm not 100% certain I understand the use case though. What exactly are these "rule suffix"es?

If the problem can be solved by combining those two trees though, then yes indeed we could look at adding such a tree.

jacobg commented 8 years ago

Thanks Niall. I think I'm satisfied with the solution I pasted above.

The "rule suffixes" go like this: 1) Our software has a list of file path suffixes, e.g., ".exe", ".txt", or even "notepad.exe" without the absolute path starting with c:\ 2) We then receive many absolute file paths as input strings, and for each absolute file path, we want to see if it matches any of the suffixes. The result should be a list of absolute file paths that match any of the suffixes (although we don't care which suffixes, we do care which absolute paths match).

Because the suffixes are relatively fixed, while the absolute paths are variable inputs at high volume, I'd like to construct my tree from the suffixes rather than the input strings. Hope that makes sense.