Upgrade to pcre2 - Githubissues

Mygod commented 6 years ago

https://android.googlesource.com/platform/external/pcre/+/android-8.0.0_r34/dist2/

madeye commented 6 years ago

As pcre2 is a very new library, the concern is it would get shadowsocks-libev not working on Ubuntu 14.04 and related distributions.

wongsyrone commented 6 years ago

PCRE2, was released in 2015

2 years later, not quite new. I checked new API before, got confused and gave up.

wongsyrone commented 6 years ago

Please check https://github.com/shadowsocks/shadowsocks-libev/pull/1792

madeye commented 6 years ago

Let's wait another two years after 14.04 LTS reaches its end of life.

wongsyrone commented 6 years ago

I prefer to ask Travis to upgrade their building environments instead of waiting.

librehat commented 6 years ago

It's not just a problem about Travis. There are many users on those old but still supported Linux distributions.

wongsyrone commented 6 years ago

users on those old but still supported

Not the reason for sticking to outdated things.

wongsyrone commented 6 years ago

OK. Issue solved. The tarball doesn't contain autogen.sh, the shipped libtool seems to cause CI problem.

Travis CI can compile this indeed.

linusyang commented 6 years ago

@Mygod @madeye @wongsyrone

Hi everyone! Besides upgrading to PCRE2, I recommend another regular expression library: sregex.

Reasons:

Pure C library, easy to embed
Perl/PCRE (mostly) compatible (except backtracking, which is not often used in ACL rules)
Modern NFA-based algorithm. No backtracking. Linear time complexity in the worst case.
Support compiling and matching multiple REs at once, similarly to RE2::Set (i.e. REs concatenated by |). Suitable for matching a bunch of ACL rules efficiently.
Optional JIT on x86-64 platform

Issues of current implementation in acl.c:

PCRE engine may have exponential time complexity in the worst case
Use a libcork list to match REs: O(n) time complexity

Sregex library is similar to modern linear-time RE engines used in Golang and Rust. They are all based on the idea of RE2 by Russ Cox. They trade backtracking with efficiency and simplicity. Other possible library choices, but I do not recommend:

RE2: written in C++ with C bindings, need to set max_mem parameter beforehand (otherwise the engine crashes)
Rust regex: written in Rust with C bindings, SIMD optimized, hard to embed
Hyperscan: written in C++ and Intel assembly, not portable
Gnu Regex: Exponential time complexity
PCRE/PCRE2: Exponential time complexity

I have tried to use sregex which only needs a small change of acl.c. The API is easy to use. An example can be found here.

wongsyrone commented 6 years ago

I would stick to the most widely used library to gain enhancements and security fixes.

linusyang commented 6 years ago

@wongsyrone Of course. My point is that we have alternative choices other than PCRE2. And there is no need to stick to PCRE if we do not need those backtracking features when there are other more efficient choices.

More and more modern languages are choosing RE2-like linear-time engines. They do not use PCRE because the Perl-style RE requires backtracking algorithm and causes unnecessary complexity and security issues. And I do not believe PCRE will switch to an NFA-based linear-time algorithm because it needs to maintain the compatibility of Perl REs.

madeye commented 6 years ago

@linusyang Could you submit a pull request? I think it's a time to switch to a RE2-like regex engine.

linusyang commented 6 years ago

@madeye No problem. I will first make a patch using sregex. And we can later discuss there to decide which RE2-like library to use.

madeye commented 6 years ago

Thanks!

But be aware that we may not achieve any notable performance improvement, as most of our rules are wildcard-like and the matching strings are short. Also, I expect PCRE has done a lot of optimizations, even though it's using exponential time algorithm.

linusyang commented 6 years ago

@madeye Our use case of ACL is that we match a single string against a bunch of REs. In such scenario, NFA-based engine outperforms the backtracking engine.

The reason is that, without backtracking, we can efficiently match a large RE concatenated with| operator. Thus, most RE2-based engines can provide a regex set data structure (which is essentially a large RE that combines all small REs by |) to match multiple REs at once efficiently, while PCRE cannot. If using PCRE, we have to stick to the current implementation in acl.c that matches a string against a list of REs.

I would like to quote the explanation from http://lh3lh3.users.sourceforge.net/reb.shtml:

Regex libraries are typically implemented with backtracking (BT) or finite state machine (FSA). The former approach is more flexible but can be exponential in time given some regex. The latter guarantees polynomial time in searching, but usually less flexible. RE2 is believed to be the only library that uses FSA and supports Perl-like syntax at the same time.

In practice, implementations based on BT and FSA seem to have similar performance on short regex. However, FSA may greatly outperform BT given the `|' operator in regex. This means for BT-based implementations, matching with multiple regex is preferred.

Mygod commented 6 years ago

I wouldn't worry too much about worst case exponential time complexity since (I believe) that would almost never happen. In fact I expect it to run in linear time on average for this use case. However it would be great if you can replace the entire libcork list with a single DFA/NFA.

P.S. I wouldn't call something invented in 1950s modern.

linusyang commented 6 years ago

@Mygod Simply ignore that word. It is just promotion. Just like people call lambda syntax modern, which was invented in 1930s.

Edit: For the worst case problem, the user can always write a malicious RE to make the ACL matching process get stuck. If we switch to a linear-time engine, this would never happen.

Mygod commented 6 years ago

OK cool.

Also I suspect preprocessing regexp into DFA/NFA is going to take a lot of computational resources (based on my past experiences). Considering our regexp could be up to 200KB long, would it take too long just to compile the regexp (especially on mobile devices), or is there any way to "compile" and cache the results?

EDIT: Wait I just realized you're talking about NFAs. In that case, I doubt there would be any improvements in performance. Are there any DFA-based engines? Considering our ACL files could be as large as 200KB and it is rarely updated, maybe a precompiled DFA would be much more beneficial.

linusyang commented 6 years ago

@Mygod For sregex library, it compiles the RE string into bytecode and can be stored efficiently. The bytecode only needs to be created once initially.

I think sregex is lightweight enough for mobile users. I have used sregex-based client for several weeks on my own Android phone. The bytecode compilation finishes almost immediately (from the logcat timestamp) for the current China ACL list.

For RE matching, a Thompson VM needs to be created to execute the bytecode. The official API requires to create a new VM for each match because it was designed for stream processing. I find this quite memory inefficient for our use case. I found a way to reset the state of VM without the need to re-create the VM.

Mygod commented 6 years ago

I took a look at sregex and I found this in its TODO:

add a bytecode optimizer to the regex VM (which also generates minimized DFAs for the Thompson VM).

Therefore I don't think there will be a big difference to use this library since Thompson's construction and NFA simulation would also take O(m) preprocessing time and O(mn) matching time. So that would be less ideal for our use case.

Is it possible to use DFA in Google's re2 instead?

linusyang commented 6 years ago

@Mygod Yes, changing to RE2 is even simpler. I actually tried RE2 before sregex, and it works well on mobile phones. But I do not recommend to use as it is a C++ library which brings another dependency. You also need to calculate the max_mem option, according to the input ACL rule size.

For sergex, we can implement the DFA minimization in our own fork. I think the upstream has not been updated for a while. Or we can find other alternative C libraries (though I could not find many).

By the way, PCRE also contains a DFA engine, but it does not perform well.

Mygod commented 6 years ago

I'm not sure about the details but I believe it's pretty computationally intensive to do DFA minimization (also I suspect it would be complicated to implement as well). And I don't see the necessity of replacing backtracking with NFA.

Performance-wise, I suspect the page you're referring to is doing the benchmark wrong. For our use case, we should really be benchmarking matching time instead of compiling time + matching time since our rules wouldn't change throughout the lifecycle of the application. At a high level, DFA matching is putting a lot of computations at preprocessing/compiling phase thus it wouldn't perform well for one-time regexps.

linusyang commented 6 years ago

DFA minimization is just an example of optimization you mentioned that missing in current sregex implementation. If DFA minimization is time consuming, we can do what RE2 optimizes in sregex as well. As I mentioned, RE2 works well and does not consume a lot of pre-processing time, for the current ACL lists used in Android clients.

If you think sregex is not fast enough, then we just stick to PCRE. Or maybe later, you can port ACL engine to RE2. RE2 may have other problems because of its DFA implementation, e.g., performance degradation for certain inputs, etc., which is another story.

The point of showing that benchmark is that even calculating C + M time, RE2 is significantly faster than DFA backend of PCRE. There is no need to change to DFA-based PCRE backend, if we can use RE2.

Edit: BTW, sregex has an unreleased DFA backend prototype, but it is not quite usable in a normal project.

Mygod commented 6 years ago

Hmm I just think it might not be worth the trouble to make sregex work if we can just use RE2.

linusyang commented 6 years ago

No problem. I just point out an alternative C library that might be useful.

BTW, what RE2 does is called lazy DFA construction, but not DFA minimization (which is ahead-of-time, and costs O(2^m) time). The method is to lazily construct DFA during NFA simulation. This algorithm requires O(m) more space for DFA cache. Some implementation details can be found at rust-lang/regex, issue 66.

Edit: It seems GNU grep uses the same algorithm (lazy DFA) as RE2 and it is written in C. Maybe this could be potentially used as an efficient (and also mature) pure C RE engine.

Mygod commented 6 years ago

Lazy DFA sounds well suited to our purpose. Maybe we should do that instead. 👍

linusyang commented 6 years ago

@Mygod I think it is not really hard to implement lazy DFA on top of sregex's NFA implementation. I can do a quick hack on that to see if this could improve the performance a little bit. (Though RE2 has more optimizations other than lazy DFA 😄) On the mean time, we may just consider switching to RE2 if necessary (or just stick to PCRE/PCRE2).

linusyang commented 6 years ago

@Mygod @madeye @wongsyrone

I made some benchmarks to test different RE engines. 19 tests are run against bypass-china.acl from Android client for 500 times for each RE engine. The result is as follows (sorted by total time):

Engine	Compile (ms)	Avg. per run (ms)	Total (C+R*500)
Google RE2	86.596	0.075	124.119
GNU regex (512 par.)	117.862	0.053	144.387
GNU regex (all par.)	370.657	0.028	384.431
PCRE	48.829	26.369	13233.336
PCRE-DFA	51.140	29.904	15003.102
OpenResty sregex	60.155	67.551	33835.438

The source code can be found in the acl-bench branch. I actually ported different regex engines for acl.c and then added a benchmark in place. One can add --with-acl-backend= when doing the configure. The current available backends are re2,sregex,grep,dfa and default(pcre).

Some conclusions from the benchmark:

Sregex: So, let's do not consider sregex any more until its DFA engine is released. It is very slow without JIT. One lesson I got is that a naïve NFA engine without proper optimization should never be considered, even it is non-backtracking.
Google RE2: It is the fastest. But I still do not recommend it for two reasons:
1. It is a C++ library.
2. It requires to set max_mem in advance. If the user input ACL rules exceed the max_mem, the engine will silently crash. I once experienced this issue on Android and spent a whole night to debug.
GNU grep's regex engine: I believe this is the best alternative library. I made a mistake about its algorithm. It actually uses the same lazy DFA algorithm as RE2. Thus, it is non-backtracking and also runs faster than NFA engines.

It is also widely used in other GNU projects, e.g., gawk, gnulib, etc. The library is very small and easy to embeded, only contains 6 files. I took those files directly from the Git project (compat/regex) without changing a single line.

The only issue that I met is the DFA pre-compilation time. If we concatenate all ACL rules into a single and large RE, the per match time is even shorter that RE2, ~~but the compilation time will be very long, about 2s on my test machine~~ (the "all par." case in the table). The fewer rules we combine, the longer per match time. We need to make a balance between the two. So I set a constant RE_PAR_SIZE to control how many rules to match on parallel at once. For the testing ACL file, I find it is a good balance to set it to 512.

Edit: I changed the code a bit and the compilation time becomes significantly shorter (2s to 300ms). Namely, I add RE_NO_SUB option as we do not need subgroup matching. And I also add fastmap option to speed up searching.
PCRE's original and DFA engine: Not quite good, 100x slower than RE2. I think we should better consider switching to another engine. For the PCRE DFA engine, I find an interesting issue. When I tried to concatenate rules using the union operator |, the more rules I combined, the worse performane I will get, e.g., if combining 3 rules, the total time will be 3x longer. This is really surprising. I think the issue might be that pcre_dfa_exec also does some compilation internally. But it is an atomic API, that I cannot lift the operation out of the loop.

wongsyrone commented 6 years ago

I changed the code a bit and the compilation time becomes significantly shorter (2s to 300ms). Namely, I add RE_NO_SUB option as we do not need subgroup matching. And I also add fastmap option to speed up searching.

I recommend send changes upstream and easier for the future upgrade instead of keeping local changes.

madeye commented 6 years ago

@linusyang Thanks for this comprehensive benchmark!

Please submit a pull request for GNU regex. GPL licensed library is always preferred here.

madeye commented 6 years ago

@linusyang BTW, any interest in working for NVIDIA? It would be great if you can join us. 😄

linusyang commented 6 years ago

@wongsyrone I only change the client code, not the library itself. 😂 I was not familiar with its API and caused the slowdown. We can just use the stock regex library.

@madeye No problem. I will clean up the code and submit a patch. Also, thanks for the job invitation. I will submit a CV when I start the job seeking.

Mygod commented 6 years ago

Very nice! Based on the data, I guess the overhead for PCRE isn't too big but I can foresee that this can save a considerable amount of battery for mobile devices on heavy load. A compiling time of 300ms is also very acceptable considering it's one-time.

Sregex: So, let's do not consider sregex any more until its DFA engine is released. It is very slow without JIT. One lesson I got is that a naïve NFA engine without proper optimization should never be considered, even it is non-backtracking.

That's exactly what I was suggesting yesterday. 😉 If you had read about Thompson's construction and NFA emulation, you would find they are too straightforward to be efficient. Therefore, DFAs are definitely preferred here.

It also would be interesting to figure out why PCRE-DFA performs just like PCRE. The compiling time is suspiciously short for DFA compared to similar implementations. I'm suspecting that they didn't do a full DFA implementation.

Mygod commented 4 years ago

@linuxyang Not sure about the time when you tested. According to re2.h:

    // The RE2 memory budget is statically divided between the two
    // Progs and then the DFAs: two thirds to the forward Prog
    // and one third to the reverse Prog.  The forward Prog gives half
    // of what it has left over to each of its DFAs.  The reverse Prog
    // gives it all to its longest-match DFA.
    //
    // Once a DFA fills its budget, it flushes its cache and starts over.
    // If this happens too often, RE2 falls back on the NFA implementation.

hosiet commented 1 year ago

This may have come too late, but Debian decided to kill old pcre3 that shadowsocks-libev currently uses. Any porting effort out of pcre3 would be appreciated, otherwise ss-libev will get removed from Debian (and later its derivatives).

Ref: https://lists.debian.org/debian-devel/2023/06/msg00383.html

shadowsocks / shadowsocks-libev

Upgrade to pcre2 #1788