nitely / nim-regex

Pure Nim regex engine. Guarantees linear time matching
https://nitely.github.io/nim-regex/
MIT License
227 stars 20 forks source link

NFA macro + better NFA matcher #58

Closed nitely closed 4 years ago

nitely commented 4 years ago

Changes:

Needs benchmarking to check for regressions.

closes #57 closes #28

nitely commented 4 years ago

Quick benchmark:

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.86ps    3.39G
============================================================================
somere.nim                                      relative  time/iter  iters/s
============================================================================
re_sol                                                       1.10ms   910.91
regex_sol                                         64.79%     1.69ms   590.20
re_nums                                                    171.08ns    5.85M
regex_nums2                                      88.46%   193.39ns    5.17M
re_nums                                                    176.39ns    5.67M
regex_nums                                       95.51%   184.69ns    5.41M

Nim's re VS nim-regex macro-NFA.

nitely commented 4 years ago

Almost done. I need to implement the find version that takes O(N*M) time, and benchmark it against the new O(N^2) version. It should be faster when N > M assuming it does not add much overhead.

nitely commented 4 years ago

benchmarks against nim-regex 0.13:

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.87ps    3.39G
============================================================================
bench.nim                                       relative  time/iter  iters/s
============================================================================
regex_old_sol                                               42.60ms    23.48
regex_sol                                                    1.85ms   541.34
regex_old_nums                                               2.43us  410.78K
regex_nums                                                 206.64ns    4.84M
regex_old_nums2                                              2.92us  342.48K
regex_nums2                                                231.74ns    4.32M
regex_old_lits_find                                         41.71ms    23.97
regex_lits_find                                              1.03ms   968.61
email_old_find_all                                            1.27s  787.08m
email_find_all                                             472.56ms     2.12
dummy                                                        0.00fs      inf
timotheecour commented 4 years ago

this sounds amazing!

a few points:

Remove epsilon-transitions

nitely commented 4 years ago

1st benchmark shows:re_nums instead of re_nums2; not sure whether both algos are comparing the same

They are. I've fixed the name.

maybe benchmark could add comparison against nre as well

IIRC, Araq said re is faster than nre, but I can add it, no problem.

-d:danger should be used for benchmarking; perhaps also --passc:-flto;

I did set -d:danger. -d:release is only 10% slower, IIRC.

maybe add nim 1.2 to CI now that it was released

Yeah, I'm using nim's docker. I'm sure it will be added in a few days.

haven't looked at the code change in detail but does that come with any loss of functionality?

No, it's just an optimization that avoids recursion in the matching algorithm.

thomastay commented 4 years ago

FWIW, I came across this PR when trying to benchmark mariomka for nim. Seems like most of the discussion has already been had in #57, but I wanted to just add in the benchmarks that I measured on my machine. Code is here in case I did something wrong; I am rather new to this repo.

Flags: -d:danger
Nim version: 1.3.1
nim-regex: 0.14.1

Benching with default GC, taking mean of 3 iterations
std/re: 28.0 - 92
std/re: 28.0 - 5301
std/re: 9.0 - 5
nim-regex: 371.0 - 92
nim-regex: 344.3333333333333 - 5301
nim-regex: 438.6666666666667 - 5

Also, I don't think a hybrid approach for nim-regex and PCRE would be a good idea, since the selling point of nim-regex is that it's a pure Nim implementation.

nitely commented 4 years ago

find and findAll can be ~10x slower on some regexes due to lack of literal optimization (i.e: a quick memchr is done to find potential matches). I'm aware of it (see #59) and I'll optimize it as soon as I've some free time.

timotheecour commented 4 years ago

Also, I don't think a hybrid approach for nim-regex and PCRE would be a good idea, since the selling point of nim-regex is that it's a pure Nim implementation.

I'm not buying that argument.

At the end of the day, we should do what's practical and useful. Certainly nim-regex can have a pure nim implementation that keeps improving, and, all else being equal (eg performance), using a pure nim implementation is better, but nothing prevents it from benefiting from optimizations which could be controlled by some flags or library options.

This allows user code to stay identical (offering same interface) regardless of whether PCRE or other library is used (implementation detail).

Very concretely, the main point of pure nim implementation is making it available on all backends/environments (eg nim vm, maybe nim js or nimscript, or in cases where PCRE is not available); but, where performance matters and PCRE is available and higher performance than the nim engine, there is no practical reason not to benefit from it (other than implementing that bridge, which should be a lot easier than making improvements on the engine itself)

+1 for providing a benchmark though

nitely commented 4 years ago

@thomastay @timotheecour good news, my regex literal optimization is 34x faster than re for the email regex, and about the same speed for the rest. It's not even the macro version, that one would be faster, but I don't think I'll implement it.

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.86ps    3.39G
============================================================================
bench.nim                                       relative  time/iter  iters/s
============================================================================
re_email_find_all                                           24.19ms    41.33
email_find_all                                  3389.08%   713.91us    1.40K
re_uri_find_all                                             24.75ms    40.40
uri_find_all                                      92.92%    26.64ms    37.54
re_ip_find_all                                               6.41ms   156.06
ip_find_all                                      101.27%     6.33ms   158.04
runes                                                        5.99ms   167.04
dummy                                                        0.00fs      inf