robert-bor / aho-corasick

Java implementation of the Aho-Corasick algorithm for efficient string matching
Apache License 2.0
890 stars 348 forks source link

Improve the performance of org.ahocorasick.trie.Trie #65

Closed LukeButters closed 5 years ago

LukeButters commented 5 years ago

Hi

My profiler seems to show that a bunch of time is being spent in:

java.util.concurrent.LinkedBlockingDeque#add

inside of

inside of org.ahocorasick.trie.Trie.constructFailureStates();

LinkedList is a faster implementation of Queue than LinkedBlockingDeque, I am measuring it to be about 3-4 times faster. Could:

private void constructFailureStates() {
        final Queue<State> queue = new LinkedBlockingDeque<>();

be changed to:

private void constructFailureStates() {
        final Queue<State> queue = new LinkedList<>();

Screenshot_20190330_103233

YuyuZha0 commented 5 years ago

LinkedBlockingDeque is thread-safe while LinkedList is not, it is important to make sure whether thread-safety is required. @LukeButters

ghost commented 5 years ago

It would be preferable to change it from:

private void constructFailureStates() {
        final Queue<State> queue = new LinkedList<>();

to:

private void constructFailureStates() {
  final Queue<State> queue = createFailureStateQueue();
  // ...
}

protected Queue<State> createFailureStateQueue() {
  return new LinkedList<>();
}

However, that's all moot since Trie has been refactored extensively.

If performance is still an issue, please open another ticket showing the bottleneck.

LukeButters commented 5 years ago

Hi @YuyuZha0 thanks for replying! I think that in this case it was never possible for two threads to be sharing that data structure although it was so long ago I hardly remember.

@DaveJarvis I will try out the new trie, thanks!

YuyuZha0 commented 5 years ago

In this case, I would say ArrayDeque may be a good choice, it relies on array, should be more CPU-Cache friendlly. @LukeButters