mrabarnett / mrab-regex

Other
437 stars 49 forks source link

Speeds of repeat searching #215

Open mrabarnett opened 8 years ago

mrabarnett commented 8 years ago

Original report by animalize (Bitbucket: animalize, GitHub: animalize).


I don't know whether these speeds are good.

Simple repeat, 0.00015 second:

#!python

regex.search(r'AA{20000}', 'A' * 20000)

Capture group repeat, 9.1 second:

#!python

regex.search(r'(A)\1{20000}', 'A' * 20000)

Recursive repeat, 56.7 second:

#!python

regex.search(r'(A)(?1){20000}', 'A' * 20000)

Tested on regex 2016.06.19, Python 3.5 64bit, Windows 10.

mrabarnett commented 8 years ago

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Sometimes you're going to get long runtimes due to backtracking.

Sometimes it's easy to see that a regex won't match in a certain position, so it can short-circuit, which is what's happening in the first example, but other times it's not so easy.

It's all a question of how much effort should be expended to look for a shortcut.

You might say that it wouldn't take much more effort to spot a shortcut in the second example, or the third example, but at each step, someone will be able to provide another example that would need only a little bit more difficult than the last, and this could go on indefinitely!

And there's already a lot of code in the module...

mrabarnett commented 8 years ago

Original comment by animalize (Bitbucket: animalize, GitHub: animalize).


I see. I didn't know there is a short-circuit, so I felt a little strange about the speeds.

Just tried, re module has a short-circuit in the second example. But this is not a vital problem in my opinion.