Open fbacchella opened 6 years ago
I improve the grok test:
@Benchmark
public Match grokSpeed() {
Match gm = grok.match("<1>totor");
gm.captures();
Map<String, Object> mapped = gm.toMap();
assert mapped.get("syslog_pri") != null;
assert mapped.get("message") != null;
return gm;
}
But the results are still bad:
Benchmark Mode Cnt Score Error Units
GrokSpeed.grokSpeed avgt 5 1.975 ± 0.012 us/op
GrokSpeed.javaRegexSpeed avgt 5 0.255 ± 0.019 us/op
@fbacchella Thanks for this benchmark tool.
We ran this with @keitaf's PR: https://github.com/paulwellnerbou/java-grok/pull/1 with our fork: https://github.com/dashbase/grokspeed/commit/7d0c7d25aaec426d2f9dd47edc4916140f22bc2a
And we see significant gains:
before:
Benchmark Mode Cnt Score Error Units
GrokSpeed.grokSpeed avgt 5 1.724 ± 0.092 us/op
GrokSpeed.javaRegexSpeed avgt 5 0.132 ± 0.007 us/op
GrokSpeed.notrealgrokSpeed avgt 5 1.565 ± 0.094 us/op
after:
Benchmark Mode Cnt Score Error Units
GrokSpeed.grokSpeed avgt 5 0.352 ± 0.012 us/op
GrokSpeed.javaRegexSpeed avgt 5 0.137 ± 0.006 us/op
GrokSpeed.notrealgrokSpeed avgt 5 0.341 ± 0.010 us/op
There are some API changes necc. for the optimization to work.
Very nice. The next step in Grok's performance would be able to change the regex engine used. Not all of them are created equals, as numerous benches, included mine, has shown that: https://github.com/fbacchella/RegexPerf. But the most import works has been done.
Thanks @fbacchella Any reason to choose joni over jregex? Jregex seems to be faster. Is the main reason that joni is operating on byte[] instead of String?
jregex was unable to handle big regex and the last release is from 2002.
I put joni into my performance improvement branch and ran the performance test. https://github.com/dashbase/java-grok/tree/oniguruma
java.regex
Benchmark Mode Cnt Score Error Units
GrokSpeed.grokSpeed avgt 5 0.332 ± 0.010 us/op
GrokSpeed.javaRegexSpeed avgt 5 0.125 ± 0.009 us/op
GrokSpeed.notrealgrokSpeed avgt 5 0.325 ± 0.014 us/op
joni
Benchmark Mode Cnt Score Error Units
GrokSpeed.grokSpeed avgt 5 0.525 ± 0.016 us/op
GrokSpeed.javaRegexSpeed avgt 5 0.124 ± 0.016 us/op
GrokSpeed.notrealgrokSpeed avgt 5 0.475 ± 0.017 us/op
Looks like joni doesn't give us performance boost, mostly because of String/UTF-16 <-> byte[]/UTF-8 conversion cost.
Strange because my RegexPerf.org_joni test includes the conversion, so I should bet results similar. But anyway, that's why I tested that too in RegexPerf: the fastest way to extract bytes from a String. Working with pure ascii if you can is much faster. I'm working with log parsing, which are mainly made from ascii, so it's intersting in this case, see https://github.com/fbacchella/LogHub/blob/master/src/main/java/loghub/processors/OnigurumaRegex.java. It tries using ascii but if it can't it tries UTF-8. OnigurumaRegex also handle big search much better. That's why I think a generic interface with a default implementation using java's regex would be helpful for peoples that process pure ascii or big strings.
And after the updates:
# JMH version: 1.19
# VM version: JDK 1.8.0_162, VM 25.162-b12
...
Benchmark Mode Cnt Score Error Units
GrokSpeed.grokSpeed avgt 5 0.357 ± 0.005 us/op
GrokSpeed.javaRegexSpeed avgt 5 0.143 ± 0.004 us/op
GrokSpeed.notrealgrokSpeed avgt 5 0.356 ± 0.004 us/op
Tested with Java 10:
# JMH version: 1.19
# VM version: JDK 10, VM 10+46
...
Benchmark Mode Cnt Score Error Units
GrokSpeed.grokSpeed avgt 5 0.441 ± 0.008 us/op
GrokSpeed.javaRegexSpeed avgt 5 0.130 ± 0.002 us/op
GrokSpeed.notrealgrokSpeed avgt 5 0.428 ± 0.020 us/op
Java's regex improved, but grok decreased !
I'm using Code Tools: jmh to bench grok against java's regex.
The result for the following simple code:
returns, on a Intel Xeon E312xx:
That's 11 time slower !
The full maven project for running tests is: fbacchella/grokspeed. It's run with
mvn clean package && java -jar target/grokspeed.jar