Open bobbbob98 opened 8 months ago
there is more than one issue it seems.
the slow compile time seems to be some Nim VM bottleneck. It should not take that long, even for a 17K length regex. The runtime version below takes less than a second to compile the regex.
disabling the capturing (?:...)
makes the matching much faster for some reason.
PCRE indeed takes a few seconds to run (~8 in my machine), while nim-regex takes ~80s. 10x slower is far too slow.
import std/strutils
import ../src/regex
#import re
proc main() =
let crawlers = readFile("./tests/crawlers.txt")
let crawlerRegex = re2("(?:" & join(splitLines(crawlers), "|") & ')') # note the (?:
echo "compiled"
var useragents = readFile("./tests/useragents.txt")
let uas = splitLines(useragents)
var i = 0
for ua in uas:
i += int(contains(ua, crawlerRegex))
echo i
main()
echo "ok"
$ time nim c -r -d:release tests/testbug.nim
[...]
Hint: mm: orc; threads: on; opt: speed; options: -d:release
61922 lines; 1.360s; 110.156MiB peakmem; [...]
compiled
20902
ok
real 1m9.897s
user 1m9.735s
sys 0m0.157s
The slow compile time seems to be a C/C++ backend issue. If you compile for js it only takes 15s. I've tried adding a debugEcho after the regex compilation and it's printed right away, so it looks like most time is spent emitting the regex data structure in C/C++.
const crawlers = staticread("./crawlers.txt")
echo crawlers.len
const reg = "(?:" & join(splitLines(crawlers), "|") & ')'
echo reg.len
const crawlerRegex = re2(reg)
static: echo "COMPILE DONE"
static: echo crawlerRegex.isInitialized
echo crawlerRegex.isInitialized
reported upstream https://github.com/nim-lang/Nim/issues/23480
Essentially I'm reworking a php bot detection library in nim. It has a list of crawlers and combines them with groups into one big regex, like (bot1|bot2|bot3|....). There's around a thousand crawlers in the list. It takes around half an hour to compile with the release flag and on a dataset of around 40000 user agents it took a few minutes to check them all. With std/re it compiles more or less instantly and checks that dataset in a few seconds. Here's an example code, files are attached: crawlers.txt useragents.txt