npgall / concurrent-trees

Concurrent Radix and Suffix Trees for Java
Apache License 2.0
503 stars 82 forks source link

Build Status Maven Central

"A tree is a tree. How many more do you have to look at?"

―Ronald Reagan

Concurrent Trees

This project provides concurrent Radix Trees and concurrent Suffix Trees for Java.

Overview

A Radix Tree (also known as patricia trie, radix trie or compact prefix tree) is a space-optimized tree data structure which allows keys (and optionally values associated with those keys) to be inserted for subsequent lookup using only a prefix of the key rather than the whole key. Radix trees have applications in string or document indexing and scanning, where they can allow faster scanning and lookup than brute force approaches. Some applications of Radix Trees:

A Suffix Tree (also known as PAT tree or position tree) is an extension of a radix tree which allows the suffixes of keys to be inserted into the tree. This allows subsequent lookup using any suffix or fragment of the key rather than the whole key, and in turn this can support fast string operations or analysis of documents. Some applications of Suffix Trees:

The implementation in this project is actually a Generalized Suffix Tree.

Concurrency Support

All of the trees (data structures and algorithms) in this project are optimized for high-concurrency and high performance reads, and low-concurrency or background writes:

As such reading threads should never encounter latency due to ongoing writes or other concurrent readers.

Tree Design

The trees in this project support lock-free reads while allowing concurrent writes, by treating the tree as a mostly-immutable structure, and assembling the changes to be made to the tree into a patch, which is then applied to the tree in a single atomic operation.

Inserting an entry into Concurrent Radix Tree which requires an existing node within the tree to be split: tree-apply-patch.png

Tree Implementations

Feature matrix for tree implementations provided in this project, and lookup operations supported.

Tree Interface Implementation Key Equals (exact match) Key Starts With Key Ends With Key Contains Find Keywords In External Documents [1]
RadixTree ConcurrentRadixTree
ReversedRadixTree ConcurrentReversedRadixTree
InvertedRadixTree ConcurrentInvertedRadixTree
SuffixTree ConcurrentSuffixTree

[1] Scanning for Keywords in External Documents

ConcurrentInvertedRadixTree allows unseen documents to be scanned efficiently for keywords contained in the tree, and performance does not degrade as additional keywords are added.

Let d = number of characters in document, n = number of keywords, k = average keyword length

Keyword scanning approach Time Complexity (Number of character comparisons) Example: 10000 10-character keywords, 10000 character document
Naive document.contains(keyword) for every keyword O(d n k) 1,000,000,000 character comparisons
ConcurrentInvertedRadixTree O(d log(k)) 10,000 character comparisons (≤100,000 times faster)

Solver Utilities

Utilities included which solve problems using the included trees.

Solver Solves
LCSubstringSolver Longest common substring problem

Documentation and Example Usage

General Documentation

For more documentation see the documentation directory.

Example Usage

Usage in Maven and Non-Maven Projects

Concurrent-Trees is in Maven Central. See Downloads.

Related Projects

Project Status

As of writing (July 2019), version 2.6.1 of concurrent-trees is the latest release.

See Release Notes and Frequently Asked Questions for details.

concurrent-trees-downloads-june-2019.png

Report any bugs/feature requests in the Issues tab. For support please use the Discussion Group, not direct email to the developers.