Grok is slow - Githubissues

fbacchella commented 6 years ago

I'm using Code Tools: jmh to bench grok against java's regex.

The result for the following simple code:

    private io.thekraken.grok.api.Grok grok;
    private Pattern syslog;

    @Setup
    public void prepare() throws GrokException {
        grok = new io.thekraken.grok.api.Grok();
        grok.addPattern("NONNEGINT", "\\b(?:[0-9]+)\\b");
        grok.addPattern("GREEDYDATA", ".*");
        grok.compile("^<%{NONNEGINT:syslog_pri}>%{GREEDYDATA:message}", false);
        syslog = Pattern.compile("<(?<syslogpri>\\b(?:[0-9]+)\\b)>(?<message>.*)");
    }

    @Benchmark
    public Match grokSpeed() {
        Match gm = grok.match("<1>totor");
        gm.captures();
        assert gm.toMap().get("syslog_pri") != null;
        assert gm.toMap().get("message") != null;
        return gm;
    }

    @Benchmark
    public Matcher javaRegexSpeed() {
        Matcher m = syslog.matcher("<1>totor");
        m.matches();
        assert m.group("syslogpri") != null;
        assert m.group("message") != null;
        return m;
    }

returns, on a Intel Xeon E312xx:

GrokSpeed.grokSpeed       avgt    5  1.952 ± 0.053  us/op
GrokSpeed.javaRegexSpeed  avgt    5  0.165 ± 0.005  us/op

That's 11 time slower !

The full maven project for running tests is: fbacchella/grokspeed. It's run with mvn clean package && java -jar target/grokspeed.jar

fbacchella commented 6 years ago

I improve the grok test:

    @Benchmark
    public Match grokSpeed() {
        Match gm = grok.match("<1>totor");
        gm.captures();
        Map<String, Object> mapped = gm.toMap();
        assert mapped.get("syslog_pri") != null;
        assert mapped.get("message") != null;
        return gm;
    }

But the results are still bad:

Benchmark                 Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed       avgt    5  1.975 ± 0.012  us/op
GrokSpeed.javaRegexSpeed  avgt    5  0.255 ± 0.019  us/op

javasoze commented 6 years ago

@fbacchella Thanks for this benchmark tool.

We ran this with @keitaf's PR: https://github.com/paulwellnerbou/java-grok/pull/1 with our fork: https://github.com/dashbase/grokspeed/commit/7d0c7d25aaec426d2f9dd47edc4916140f22bc2a

And we see significant gains:

before:

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  1.724 ± 0.092  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.132 ± 0.007  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  1.565 ± 0.094  us/op

after:

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.352 ± 0.012  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.137 ± 0.006  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.341 ± 0.010  us/op

There are some API changes necc. for the optimization to work.

fbacchella commented 6 years ago

Very nice. The next step in Grok's performance would be able to change the regex engine used. Not all of them are created equals, as numerous benches, included mine, has shown that: https://github.com/fbacchella/RegexPerf. But the most import works has been done.

javasoze commented 6 years ago

Thanks @fbacchella Any reason to choose joni over jregex? Jregex seems to be faster. Is the main reason that joni is operating on byte[] instead of String?

fbacchella commented 6 years ago

jregex was unable to handle big regex and the last release is from 2002.

keitaf commented 6 years ago

I put joni into my performance improvement branch and ran the performance test. https://github.com/dashbase/java-grok/tree/oniguruma

java.regex

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.332 ± 0.010  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.125 ± 0.009  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.325 ± 0.014  us/op

joni

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.525 ± 0.016  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.124 ± 0.016  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.475 ± 0.017  us/op

Looks like joni doesn't give us performance boost, mostly because of String/UTF-16 <-> byte[]/UTF-8 conversion cost.

fbacchella commented 6 years ago

Strange because my RegexPerf.org_joni test includes the conversion, so I should bet results similar. But anyway, that's why I tested that too in RegexPerf: the fastest way to extract bytes from a String. Working with pure ascii if you can is much faster. I'm working with log parsing, which are mainly made from ascii, so it's intersting in this case, see https://github.com/fbacchella/LogHub/blob/master/src/main/java/loghub/processors/OnigurumaRegex.java. It tries using ascii but if it can't it tries UTF-8. OnigurumaRegex also handle big search much better. That's why I think a generic interface with a default implementation using java's regex would be helpful for peoples that process pure ascii or big strings.

fbacchella commented 6 years ago

And after the updates:

# JMH version: 1.19
# VM version: JDK 1.8.0_162, VM 25.162-b12
...
Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.357 ± 0.005  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.143 ± 0.004  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.356 ± 0.004  us/op

Tested with Java 10:

# JMH version: 1.19
# VM version: JDK 10, VM 10+46
...
Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.441 ± 0.008  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.130 ± 0.002  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.428 ± 0.020  us/op

Java's regex improved, but grok decreased !

thekrakken / java-grok

Grok is slow #75