gnosygnu / xowa

xowa offline wiki application
Other
379 stars 40 forks source link

Scribunto: Optimize regex calls for long-running functions #749

Open gnosygnu opened 4 years ago

gnosygnu commented 4 years ago

This issue is related to #737.

At the moment, I can only see 3 ways to "fix" this. Listing from most ideal, to least ideal:

The last option looks like the best "least worst" way to go. Unfortunately, my JNI experience is limited and my C++ even more so. This could take a while...

desb42 commented 4 years ago

I have been doing some experimentation on the website https://regex101.com/ This contains a version (not the latest) of PCRE

my test regex is

^[a-z][a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?[a-z]?$

That's the equivalent of ^[a-z]{1,17}$

If I start with a test string that is less than 18 characters, everything is fine

However on the 18th character, the regex engine reports an error catastrophic backtracking

Another way to demonstrate this is to start with the regex ^[a-z][a-z]?$ and a test string of 18 characters

With this starting example, the number of step shown is 5

Adding [a-z]? to the regex so it looks like ^[a-z][a-z]?[a-z]?$ gives 9 steps

Progressively adding [a-z]? to the regex, the step count shows as 17, 33, 65, 129, 257, 513, 1025, 2049, 4097, 8193, 16385, 32769, 65537

Note a doubling of steps for each addition.

This regex engine stops (or gives an error) above 65537

I think that the PHP implementation of PCRE follows in this vain - in other words it counts the steps and when it reaches too many, fails (so no everlasting loop)

Perhaps some sort of counting could be introduced to the java implementation?

gnosygnu commented 4 years ago

Interesting! Thanks for the site link!

However on the 18th character, the regex engine reports an error catastrophic backtracking

I gave it a try now, and see the catastrophic backtracking error as well

Note a doubling of steps for each addition.

Yup. Nice way to confirm the theory

Perhaps some sort of counting could be introduced to the java implementation?

Let me think about it. The change would have to occur in the LuaJ pattern matching. There may be a way to handle simple cases like ^[a-z]{1,17}$ but any variation would be difficult (like ^[a-z]{1,10}[a-m][a-p][a-z]{1,5}$

Thanks!