python / cpython

The Python programming language
https://www.python.org/
Other
60.02k stars 29.05k forks source link

re match becomes exponentially slow on larger inputs #32972

Closed cb96e7a7-b820-4fd0-be0b-8e3c8b0553ee closed 23 years ago

cb96e7a7-b820-4fd0-be0b-8e3c8b0553ee commented 23 years ago
BPO 212521
Nosy @tim-one, @chmarr

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['extension-modules', 'invalid'] title = 're match becomes exponentially slow on larger inputs' updated_at = user = 'https://github.com/chmarr' ``` bugs.python.org fields: ```python activity = actor = 'chriscog' assignee = 'none' closed = True closed_date = None closer = None components = ['Extension Modules'] creation = creator = 'chriscog' dependencies = [] files = [] hgrepos = [] issue_num = 212521 keywords = [] message_count = 4.0 messages = ['1082', '1083', '1084', '1085'] nosy_count = 2.0 nosy_names = ['tim.peters', 'chriscog'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue212521' versions = [] ```

cb96e7a7-b820-4fd0-be0b-8e3c8b0553ee commented 23 years ago

The following code will run very slowly, using either python 1.5.2 or 1.6b1:

----8\<----

import re

s = """

arp ARP commands class Classifier commands dns DNS commands email Email commands event User event commands frame Frame Relay commands group Group configuration commands help On-Line help facility hl Host list configuration commands hostdb Host database commands image Image commands links Link commands look Withdraw touch access (go back to look-only access) measure Measurement commands mib MIB commands net Network statistics commands partition Bandwidth partition commands policy Policy commands reset Reset the PacketShaper

"""

r = re.compile ( r"\n\n(\s+.+\n)+\n\n$" )

mo = r.match ( s )

if mo:
        print "groups = %s" % mo.groups()
else:
        print "no match"
----8<

The above program takes 11 seconds to run on my system. FOr each line that I add in 's', the time DOUBLES. (And halves if I remove lines)

tim-one commented 23 years ago

By its very nature, this regular expression will match quickly when it *does match, but consume enormous amounts of time when it doesn't match. Sorry, but "tough luck" is the only answer here. O'Reilly publishes a very good book, "Mastering Regular Expressions", by Jeffrey Friedl, that explains why in detail. Don't have time to write a book here \<wink>, but, as a general hint, whenever you have nested repetition ("+" inside a "+" group, etc), unless you know exactly what you're doing you're going to cause *any "NFA" regexp engine to take exponential time in the cases the regexp doesn't match.

Bring it up on comp.lang.python! I'm sure someone will take the time to show you how to write the regexp in a way that works better. Or buy the book. Or don't use regexps for this problem to begin with.

tim-one commented 23 years ago

BTW, try this:

r = re.compile (r"\n\n(^[ \t][^\n]+\n)+\n\n$",
                re.MULTILINE)

If you want to know why I did that, buy the book!

cb96e7a7-b820-4fd0-be0b-8e3c8b0553ee commented 23 years ago

Ah yes, I understand what's going on, here. For those that are just tuning in, the part of the re that is taking so much time is this:

\s+.+

The problem is that \s+ can reduce one or more spaces, and the .+ reduces the rest. Of course, the RE engine doesnt know how many spaces the \s should consume, and tries every combination exhaustively. This doesnt take much time in itself, but when coupled with the (...)+ group, its instantly of the order n^2

I can solve this problem by changing the \s+ to simply \s, since I dont really care if I match one or more \s, as long as there's at least one. tim_one suggested using [ \t] instead, since it makes sure I dont gobble a \n either.

As an ancillary, I think there's been recent optimisations in the pcre code (external to python), as I cannot reproduce the problem on my box at home, which also uses python 1.5.2. I can only guess that it was compiled with a later version of PCRE.