nitely / nim-regex

Pure Nim regex engine. Guarantees linear time matching
https://nitely.github.io/nim-regex/
MIT License
225 stars 20 forks source link

Slow performance with large amount of groups #138

Open bobbbob98 opened 3 months ago

bobbbob98 commented 3 months ago

Essentially I'm reworking a php bot detection library in nim. It has a list of crawlers and combines them with groups into one big regex, like (bot1|bot2|bot3|....). There's around a thousand crawlers in the list. It takes around half an hour to compile with the release flag and on a dataset of around 40000 user agents it took a few minutes to check them all. With std/re it compiles more or less instantly and checks that dataset in a few seconds. Here's an example code, files are attached: crawlers.txt useragents.txt

import regex, strutils
const crawlerRegex = re2('(' & join(splitLines(staticRead("crawlers.txt")), "|") & ')')
let uas = splitLines(readFile("useragents.txt"))
for ua in uas:
    echo ua, ", ", contains(ua, crawlerRegex)
nitely commented 3 months ago

there is more than one issue it seems.

the slow compile time seems to be some Nim VM bottleneck. It should not take that long, even for a 17K length regex. The runtime version below takes less than a second to compile the regex.

disabling the capturing (?:...) makes the matching much faster for some reason.

PCRE indeed takes a few seconds to run (~8 in my machine), while nim-regex takes ~80s. 10x slower is far too slow.

import std/strutils
import ../src/regex
#import re

proc main() =
  let crawlers = readFile("./tests/crawlers.txt")
  let crawlerRegex = re2("(?:" & join(splitLines(crawlers), "|") & ')')   # note the (?:
  echo "compiled"
  var useragents = readFile("./tests/useragents.txt")
  let uas = splitLines(useragents)
  var i = 0
  for ua in uas:
    i += int(contains(ua, crawlerRegex))
  echo i
main()
echo "ok"
$ time nim c -r -d:release tests/testbug.nim
[...]
Hint: mm: orc; threads: on; opt: speed; options: -d:release
61922 lines; 1.360s; 110.156MiB peakmem; [...]
compiled
20902
ok

real    1m9.897s
user    1m9.735s
sys     0m0.157s
nitely commented 3 months ago

The slow compile time seems to be a C/C++ backend issue. If you compile for js it only takes 15s. I've tried adding a debugEcho after the regex compilation and it's printed right away, so it looks like most time is spent emitting the regex data structure in C/C++.

const crawlers = staticread("./crawlers.txt")
echo crawlers.len
const reg = "(?:" & join(splitLines(crawlers), "|") & ')'
echo reg.len
const crawlerRegex = re2(reg)
static: echo "COMPILE DONE"
static: echo crawlerRegex.isInitialized
echo crawlerRegex.isInitialized
nitely commented 3 months ago

reported upstream https://github.com/nim-lang/Nim/issues/23480