google / re2j

linear time regular expression matching in Java
Other
1.19k stars 160 forks source link

Re2j runs 5x slower than java.util.regex #162

Open YuyuZha0 opened 1 year ago

YuyuZha0 commented 1 year ago

Here is my benchmark:


import org.apache.commons.lang3.RandomStringUtils;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Threads;
import org.openjdk.jmh.annotations.Warmup;

import java.util.Iterator;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

@Warmup(iterations = 2, time = 3)
@Measurement(iterations = 2, time = 3)
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Threads(3)
@Fork(1)
/*
 * <pre>
 *     Benchmark                Mode  Cnt     Score   Error  Units
 * Re2jBenchmark.javaMatch  avgt    2   175.926          ns/op
 * Re2jBenchmark.re2jMatch  avgt    2  1067.591          ns/op
 *     <pre/>
 */
public class Re2jBenchmark {

  private static final String PATTERN = "[0-5][0-6]{1,3}[1-7]{1,2}[2580]{1,3}[127]{0,8}";
  private static final Pattern p1 = Pattern.compile(PATTERN);
  private static final com.google.re2j.Pattern p2 = com.google.re2j.Pattern.compile(PATTERN);

  private List<String> inputs;

  private Iterator<String> iterator;

  @Setup
  public void setup() {
    inputs =
        IntStream.rangeClosed(0, 999)
            .mapToObj(n -> RandomStringUtils.randomNumeric(32))
            .collect(Collectors.toList());
    iterator = inputs.listIterator();
  }

  private String nextInput() {
    if (!iterator.hasNext()) {
      iterator = inputs.listIterator();
    }
    return iterator.next();
  }

  @Benchmark
  public boolean javaMatch() {
    return p1.matcher(nextInput()).find();
  }

  @Benchmark
  public boolean re2jMatch() {
    return p2.matcher(nextInput()).find();
  }
}

here is the result:

Benchmark                Mode  Cnt     Score   Error  Units
Re2jBenchmark.javaMatch  avgt    2   175.926          ns/op
Re2jBenchmark.re2jMatch  avgt    2  1067.591          ns/op
RG9 commented 1 year ago

I get similar results (RE2 is 3x slower):

  Benchmark                                 Mode  Cnt       Score      Error  Units
  Re2Benchmark.wholeWordsEndingWithNn_jdk  thrpt    5       4.496 ±    0.292  ops/s
  Re2Benchmark.wholeWordsEndingWithNn_re2  thrpt    5       1.582 ±    0.124  ops/s

Note: I tested on sample (16MB book) from https://github.com/rust-leipzig/regex-performance and pattern \b\w+nn\b for which RE2 excels in their benchmark.

On the other hand, RE2 is tremendously faster for regex with 100 alternations (first names) on the same sample, which is consistent with https://github.com/google/re2j#why-should-i-switch:

  Benchmark                                 Mode  Cnt       Score      Error  Units
  Re2Benchmark.manyAlternation_jdk         thrpt    5       0.278 ±    0.002  ops/s
  Re2Benchmark.manyAlternations_re2        thrpt    5     111.601 ±    0.324  ops/s

Note: length of input text also matters

Code: https://github.com/RG9/rg-regex-demo/commit/209d9b4eb2c883587f4e4d8b584e6f17f62d41ee#diff-fe662066dca41f899d163250973f0db8e950953eb9b03a7d08c71f5fdb6a0d70