samth / test-bugs

2 stars 0 forks source link

extremely slow regexp matching #79

Open samth opened 12 years ago

samth commented 12 years ago
Originally submitted on: Wed Feb 29 13:40:01 -0500 2012

This program seems to loop forever.

lang racket

(define h ">Number: 105\n>Category: mcmicmac\n>Synopsis: language levels\n>Confidential: no\n>Severity: serious\n>Priority: medium\n>Responsible: shriram\n>State: closed\n>Class: sw-bug\n>Submitter-Id: unknown\n>Arrival-Date: Mon May 19 14:35:02 -0400 1997\n>Last-Modified: Wed Apr 28 11:02:13 -0400 2004\n>Originator: Mark W. Krentel\n>Release: unknown\n\n") (define r

px">Number:\s+(.)\n>Category:\s+(.)\n>Synopsis:\s+(.)\n>Confidential:\s+(.)\n>Severity:\s+(.)\n>Priority:\s+(.)\n>Responsible:\s+(.)\n>State:\s+(.)\n>Class:\s+(.)\n>Submitter-Id:\s+(.)\n>Arrival-Date:\s+(.)\n>Closed-Date:\s+(.)\n>Last-Modified:\s+(.)\n>Originator:\s+(.)\n>Release:")

(regexp-match r h)

Release:

5.2.1.7--2012-02-29(3267738/g)

Environment:

unix "Linux loki 3.0.0-15-generic #26-Ubuntu SMP Fri Jan 20 17:23:00 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux" (x86_64-linux/3m) (get-display-depth) = 32 Human Language: english (current-memory-use) 427125968 Links: (links) = ("gcstat" "raco-git" "assignments" "var" "disassemble" "book" "class"); (links #:user? #f) = (); (links #:root? #t) = (); (links #:user? #f #:root? #t) = ()

Collections: ("/home/samth/sw/plt/add-on/5.2.1.7/collects" ("info-domain")) ("/home/samth/sw/plt/collects" ("datalog" "scriblib" "html" "racklog" "drracket" "sqr.png" "dynext" "honu" "2htdp" "graphics" "make" "eopl" "test-box-recovery" "hierlist" "racket" "rackunit" "openssl" "r5rs" "algol60" "mzcom" "redex" "texpict" "swindle" "defaults" "trace" "combinator-parser" "mysterx" "info-domain" "meta" "teachpack" "setup" "repo-time-stamp" "games" "r6rs" "trig.png" "icons" "typed-scheme" "test-engine" "lazy" "stepper" "at-exp" "planet" "tests" "rnrs" "db" "web-server" "framework" "net" "scribblings" "string-constants" "help" "browser" "tex2page" "mzlib" "parser-tools" "errortrace" "data" "sirmail" "plot" "launcher" "handin-client" "syntax" "profile" "s-exp" ".gitignore" "slideshow" "plai" "htdp" "typed-racket" "scheme" "syntax-color" "guibuilder" "drscheme" "raco" "srfi" "reader" "preprocessor" "compiler" "config" "mred" "handin-server" "schemeunit" "typed" "macro-debugger" "deinprogramm" "afm" "ffi" "gui-debugger" "readline" "scribble" "unstable" "picturing-programs" "file" "sgl" "im! ages" "wxme" "xrepl" "lang" "xml" "mrlib" "version" "frtime" "mzscheme" "embedded-gui" "slatex"))

Computer Language: (("Determine language from source") (#(#t print mixed-fraction-e #f #t test-coverage) (default) #() "#lang racket\n" #f #t))

This bug was converted from Gnats bug 12609.
racket-bug-submit commented 12 years ago

On Wed, 29 Feb 2012 14:44:23 -0500, eli@barzilay.org wrote:

An hour ago, samth@ccs.neu.edu wrote:

This program seems to loop forever. [...]

  1. It's not. If you'd try to minimize it, you'd see that it's just very slow.
(let loop ([i 0]
           [h ">Release:        unknown\n\n"]
           [r ">Closed-Date:\s+(._)\n>Release:"])
  (define px (pregexp r))
  (printf "~a. " i)
  (time (regexp-match px h))
  (loop (add1 i)
        (string-append ">Number:         105\n" h)
        (string-append ">Number:\s+(._)\n" r)))
  1. This is some pathological case where it works hard to eventually fail (note the lack of `Closed-Date' in the string, which is in your example).
  2. Unrelated, it's a bad way to parse bug files. I don't think that there's any guarantee about their order, and certainly not about the existence of all fields. `Closed-Date' is a good hint for this too.

      ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                http://barzilay.org/                   Maze is Life!
samth commented 12 years ago

On Wed, 29 Feb 2012 14:53:26 -0500, samth@ccs.neu.edu wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay eli@barzilay.org wrote:

An hour ago, samth@ccs.neu.edu wrote:

This program seems to loop forever. [...]

  1. It's not.  If you'd try to minimize it, you'd see that it's just   very slow.

    (let loop ([i 0]                [h ">Release:        unknown\n\n"]                [r ">Closed-Date:\s+(.)\n>Release:"])       (define px (pregexp r))       (printf "~a. " i)       (time (regexp-match px h))       (loop (add1 i)             (string-append ">Number:         105\n" h)             (string-append ">Number:\s+(.)\n" r)))

Ah, I see. Thanks. But, ...

  1. This is some pathological case where it works hard to eventually   fail (note the lack of `Closed-Date' in the string, which is in   your example).

Why is it so slow on this? This is a fairly short string, it

shouldn't take this long.

sam th samth@ccs.neu.edu

racket-bug-submit commented 12 years ago

On Wed, 29 Feb 2012 15:07:24 -0500, eli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay eli@barzilay.org wrote:

  1. This is some pathological case where it works hard to eventually fail (note the lack of `Closed-Date' in the string, which is in your example).

Why is it so slow on this? This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

      ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                http://barzilay.org/                   Maze is Life!
samth commented 12 years ago

On Wed, 29 Feb 2012 15:14:18 -0500, samth@ccs.neu.edu wrote:

On Wed, Feb 29, 2012 at 3:07 PM, Eli Barzilay eli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay eli@barzilay.org wrote:

  1. This is some pathological case where it works hard to    eventually fail (note the lack of `Closed-Date' in the string,    which is in your example).

Why is it so slow on this?  This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

It looks like [^\n]* is still slow. Is there something I can write

here that isn't quite so slow?

sam th samth@ccs.neu.edu

racket-bug-submit commented 12 years ago

On Wed, 29 Feb 2012 17:18:06 -0500, eli@barzilay.org wrote:

Two hours ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 3:07 PM, Eli Barzilay eli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay eli@barzilay.org wrote:

  1. This is some pathological case where it works hard to    eventually fail (note the lack of `Closed-Date' in the string,    which is in your example).

Why is it so slow on this?  This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

It looks like [^\n]* is still slow. Is there something I can write here that isn't quite so slow?

In some cases you can get reasonable performance using non-greedy operators.

I still don't see how a monolithic regexp is better than simply splitting the text into fields first (trivial in this format), then parse each field (easy too: each field is either some text on a single line, or multiline fields that follow a header line with just the field name):

(define (parse-gnats pr-text) (map (λ (s) (cdr (or (regexp-match (if (regexp-match? #rx"\n" s)

rx"^([^:]+): \n(.)$"

                                #rx"^([^:]+): _([^\n]_)$")
                              s)
                (error "Unexpected input:" s))))
     (cdr (regexp-split #rx"\n>" pr-text))))

(If I add the extra keywords to `regexp-match*', then this could even be shorter.)

      ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                http://barzilay.org/                   Maze is Life!
racket-bug-submit commented 12 years ago

On Wed, 29 Feb 2012 16:28:41 -0700, ryan@cs.utah.edu wrote:

On 02/29/2012 01:14 PM, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 3:07 PM, Eli Barzilayeli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilayeli@barzilay.org wrote:

  1. This is some pathological case where it works hard to eventually fail (note the lack of `Closed-Date' in the string, which is in your example).

Why is it so slow on this? This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

It looks like [^\n]* is still slow. Is there something I can write here that isn't quite so slow?

Not a solution, but you might find parts of this article interesting:

http://swtch.com/~rsc/regexp/regexp1.html

Ryan