robert-bor / aho-corasick

Java implementation of the Aho-Corasick algorithm for efficient string matching
Apache License 2.0
890 stars 348 forks source link

Performance improvement by using FastUtil Char2ObjectMap (concept PR) #90

Open omarshibli opened 3 years ago

omarshibli commented 3 years ago

Description

Suggestion for performance improvement by using FastUtil Char2ObjectMap, specialized char map implementation.

Changes

Observed results

Benchmark result raw files bench-post.2.txt bench-post.csv.txt bench-post.txt bench-pre.2.txt bench-pre.csv.txt bench-pre.txt

Down sides

fastutil dependecy is pretty big, benchmark fat jar is 21M after adding it, prior to that it was about 2.4M

Reference

https://github.com/robert-bor/aho-corasick/issues/31

codecov-io commented 3 years ago

Codecov Report

Merging #90 (45524fb) into master (012ad80) will decrease coverage by 25.34%. The diff coverage is 3.55%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master      #90       +/-   ##
=============================================
- Coverage     89.56%   64.22%   -25.35%     
  Complexity      209      209               
=============================================
  Files            26       28        +2     
  Lines           479      668      +189     
  Branches         66       89       +23     
=============================================
  Hits            429      429               
- Misses           34      223      +189     
  Partials         16       16               
Impacted Files Coverage Δ Complexity Δ
src/main/java/com/intendia/jmh/GcProfiler.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...rg/ahocorasick/benchmark/BenchmarkPayloadTrie.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
src/main/java/org/ahocorasick/trie/State.java 86.48% <66.66%> (ø) 20.00 <3.00> (ø)
...c/main/java/org/ahocorasick/trie/PayloadState.java 96.96% <100.00%> (ø) 21.00 <4.00> (ø)
...rc/main/java/org/ahocorasick/trie/PayloadTrie.java 95.83% <100.00%> (ø) 60.00 <0.00> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 012ad80...45524fb. Read the comment docs.

ghost commented 3 years ago

The increase is size may have unforeseen consequences with applications depending on this library, such as: embedded systems, increased download times, and so forth. Fortunately, we can employ a similar strategy to how JDBC works, with the downside that it'll be a fair amount more effort.

For example:

import static CollectionFactory.createHashMap;

public class C {
  final Map<String, String> map = createHashMap( ... );
}

The default implementation would use Java's standard libraries. If an implementation exists for a drop-in replacement---fastutil, HPPC, Koloboke, or HFTC---then that library should be used through dynamic detection.

In this way, if developers decide to add the dependency, then it will be used. If they don't import the dependency, then the standard JDK is used and the resulting JAR file won't explode by an extra 20MB. (ProGuard can reduce the number of bundled classes to a bare minimum, but its usage not something we can impose upon developers.)

For cases where there isn't a direct 1:1 correlation (e.g., char primitive instead of Character class), then perhaps the strategy pattern or a functional interface may be appropriate to abstract away the implementation detail.

omarshibli commented 3 years ago

Thanks for the input. I'm not sure if JDBC-style is the optimal approach here, the usage pattern of this library is very well defined.

ghost commented 3 years ago

Giving that the direction that I've in mind is to find the optimal implementation and add it the repo.

What other ways can you see to get these performance gains that don't require developers who import this software to get a surprise 20MB increase to their own applications?

Note: A PR that incurs a 10-fold increase to the final JAR size will not be accepted.

omarshibli commented 3 years ago

I'm totally with you. As I mentioned giving that the usage pattern is well defined, I want to find optimal implementation and copy it into the repo, so we will just have an increase of few classes.

omarshibli commented 3 years ago

I'm also checking other libraries: https://github.com/thomasmueller/minperf https://github.com/nicholassm/perfect-hashmaps