samth / test-bugs

2 stars 0 forks source link

extremely slow regexp matching #125

Open samth opened 12 years ago

samth commented 12 years ago
Originally submitted on: Wed Feb 29 13:40:01 -0500 2012

This program seems to loop forever.

#lang racket
(define h 
">Number:         105\n>Category:       mcmicmac\n>Synopsis:       language levels\n>Confidential:   no\n>Severity:       serious\n>Priority:       medium\n>Responsible:    shriram\n>State:          closed\n>Class:          sw-bug\n>Submitter-Id:   unknown\n>Arrival-Date:   Mon May 19 14:35:02 -0400 1997\n>Last-Modified:  Wed Apr 28 11:02:13 -0400 2004\n>Originator:     Mark W. Krentel\n>Release:        unknown\n\n")
(define r 
  #px">Number:\\s+(.*)\n>Category:\\s+(.*)\n>Synopsis:\\s+(.*)\n>Confidential:\\s+(.*)\n>Severity:\\s+(.*)\n>Priority:\\s+(.*)\n>Responsible:\\s+(.*)\n>State:\\s+(.*)\n>Class:\\s+(.*)\n>Submitter-Id:\\s+(.*)\n>Arrival-Date:\\s+(.*)\n>Closed-Date:\\s+(.*)\n>Last-Modified:\\s+(.*)\n>Originator:\\s+(.*)\n>Release:")
(regexp-match r h)
Release:

5.2.1.7--2012-02-29(3267738/g)

Environment:

unix "Linux loki 3.0.0-15-generic #26-Ubuntu SMP Fri Jan 20 17:23:00 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux" (x86_64-linux/3m) (get-display-depth) = 32 Human Language: english (current-memory-use) 427125968 Links: (links) = ("gcstat" "raco-git" "assignments" "var" "disassemble" "book" "class"); (links #:user? #f) = (); (links #:root? #t) = (); (links #:user? #f #:root? #t) = ()

Collections: ("/home/samth/sw/plt/add-on/5.2.1.7/collects" ("info-domain")) ("/home/samth/sw/plt/collects" ("datalog" "scriblib" "html" "racklog" "drracket" "sqr.png" "dynext" "honu" "2htdp" "graphics" "make" "eopl" "test-box-recovery" "hierlist" "racket" "rackunit" "openssl" "r5rs" "algol60" "mzcom" "redex" "texpict" "swindle" "defaults" "trace" "combinator-parser" "mysterx" "info-domain" "meta" "teachpack" "setup" "repo-time-stamp" "games" "r6rs" "trig.png" "icons" "typed-scheme" "test-engine" "lazy" "stepper" "at-exp" "planet" "tests" "rnrs" "db" "web-server" "framework" "net" "scribblings" "string-constants" "help" "browser" "tex2page" "mzlib" "parser-tools" "errortrace" "data" "sirmail" "plot" "launcher" "handin-client" "syntax" "profile" "s-exp" ".gitignore" "slideshow" "plai" "htdp" "typed-racket" "scheme" "syntax-color" "guibuilder" "drscheme" "raco" "srfi" "reader" "preprocessor" "compiler" "config" "mred" "handin-server" "schemeunit" "typed" "macro-debugger" "deinprogramm" "afm" "ffi" "gui-debugger" "readline" "scribble" "unstable" "picturing-programs" "file" "sgl" "im! ages" "wxme" "xrepl" "lang" "xml" "mrlib" "version" "frtime" "mzscheme" "embedded-gui" "slatex"))

Computer Language: (("Determine language from source") (#(#t print mixed-fraction-e #f #t test-coverage) (default) #() "#lang racket\n" #f #t))

This bug was converted from Gnats bug 12609
racket-bug-submit commented 12 years ago
On Wed, 29 Feb 2012 14:44:23 -0500, eli@barzilay.org wrote:

An hour ago, samth@ccs.neu.edu wrote:

This program seems to loop forever. [...]

  1. It's not. If you'd try to minimize it, you'd see that it's just very slow.
     (let loop ([i 0]
                [h ">Release:        unknown\n\n"]
                [r ">Closed-Date:\\s+(.*)\n>Release:"])
       (define px (pregexp r))
       (printf "~a. " i)
       (time (regexp-match px h))
       (loop (add1 i)
             (string-append ">Number:         105\n" h)
             (string-append ">Number:\\s+(.*)\n" r)))
  1. This is some pathological case where it works hard to eventually fail (note the lack of `Closed-Date' in the string, which is in your example).
  2. Unrelated, it's a bad way to parse bug files. I don't think that there's any guarantee about their order, and certainly not about the existence of all fields. `Closed-Date' is a good hint for this too.
samth commented 12 years ago
On Wed, 29 Feb 2012 14:53:26 -0500, samth@ccs.neu.edu wrote:
On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay <eli@barzilay.org> wrote:
> An hour ago, samth@ccs.neu.edu wrote:
>> This program seems to loop forever.
>> [...]
>
> 1. It's not.  If you'd try to minimize it, you'd see that it's just
>   very slow.
>
>     (let loop ([i 0]
>                [h ">Release:        unknown\n\n"]
>                [r ">Closed-Date:\\s+(.*)\n>Release:"])
>       (define px (pregexp r))
>       (printf "~a. " i)
>       (time (regexp-match px h))
>       (loop (add1 i)
>             (string-append ">Number:         105\n" h)
>             (string-append ">Number:\\s+(.*)\n" r)))

Ah, I see. Thanks. But, ...

  1. This is some pathological case where it works hard to eventually   fail (note the lack of `Closed-Date' in the string, which is in   your example).

Why is it so slow on this? This is a fairly short string, it shouldn't take this long.

racket-bug-submit commented 12 years ago
On Wed, 29 Feb 2012 15:07:24 -0500, eli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay eli@barzilay.org wrote:

  1. This is some pathological case where it works hard to eventually fail (note the lack of `Closed-Date' in the string, which is in your example).

Why is it so slow on this? This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

samth commented 12 years ago
On Wed, 29 Feb 2012 15:14:18 -0500, samth@ccs.neu.edu wrote:

On Wed, Feb 29, 2012 at 3:07 PM, Eli Barzilay eli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay eli@barzilay.org wrote:

  1. This is some pathological case where it works hard to    eventually fail (note the lack of `Closed-Date' in the string,    which is in your example).

Why is it so slow on this?  This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

It looks like [^\n]* is still slow. Is there something I can write here that isn't quite so slow?

racket-bug-submit commented 12 years ago
On Wed, 29 Feb 2012 17:18:06 -0500, eli@barzilay.org wrote:

Two hours ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 3:07 PM, Eli Barzilay eli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilay eli@barzilay.org wrote:

  1. This is some pathological case where it works hard to    eventually fail (note the lack of `Closed-Date' in the string,    which is in your example).

Why is it so slow on this?  This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

It looks like [^\n]* is still slow. Is there something I can write here that isn't quite so slow?

In some cases you can get reasonable performance using non-greedy operators.

I still don't see how a monolithic regexp is better than simply splitting the text into fields first (trivial in this format), then parse each field (easy too: each field is either some text on a single line, or multiline fields that follow a header line with just the field name):

  (define (parse-gnats pr-text)
    (map (λ (s)
           (cdr (or (regexp-match (if (regexp-match? #rx"\n" s)
                                    #rx"^([^:]+): *\n(.*)$"
                                    #rx"^([^:]+): *([^\n]*)$")
                                  s)
                    (error "Unexpected input:" s))))
         (cdr (regexp-split #rx"\n>" pr-text))))

(If I add the extra keywords to `regexp-match*', then this could even be shorter.)

racket-bug-submit commented 12 years ago
On Wed, 29 Feb 2012 16:28:41 -0700, ryan@cs.utah.edu wrote:

On 02/29/2012 01:14 PM, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 3:07 PM, Eli Barzilayeli@barzilay.org wrote:

10 minutes ago, Sam Tobin-Hochstadt wrote:

On Wed, Feb 29, 2012 at 2:44 PM, Eli Barzilayeli@barzilay.org wrote:

  1. This is some pathological case where it works hard to eventually fail (note the lack of `Closed-Date' in the string, which is in your example).

Why is it so slow on this? This is a fairly short string, it shouldn't take this long.

(I think that it's the inefficient way in which .* fails, and each level multiplies that work.)

It looks like [^\n]* is still slow. Is there something I can write here that isn't quite so slow?

Not a solution, but you might find parts of this article interesting:

http://swtch.com/~rsc/regexp/regexp1.html

Ryan