haskell-hvr / regex-tdfa

Pure Haskell Tagged DFA Backend for "Text.Regex" (regex-base)
http://hackage.haskell.org/package/regex-tdfa
Other
36 stars 9 forks source link

POSIX submatch semantics is broken in two (rare) cases #2

Open neongreen opened 4 years ago

neongreen commented 4 years ago

https://github.com/ChrisKuklewicz/regex-tdfa/issues/12, originally reported by @skvadrik


As I've been working on the implementation of POSIX-compliant capturing groups in RE2C (http://re2c.org), I discovered a couple of bugs in Regex-TDFA. I found them when fuzz-testing my implementation against Regex-TDFA-1.2.2. The algorithm used in RE2C is described in detail in the following paper:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

It is a slightly modified version of Laurikari algorithm. POSIX submatch semantics is due to Kuklewicz: https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475 , but I made an attempt on formalizing Kuklewicz algorithm (also described in the paper). The reported bugs are rare (fuzzer found them approximately once in 50000 runs), so they are probably caused by some mis-optimization.

The first bug can be triggered by regular expression (((a*)|b)|b)+ and input string ab: Regex-TDFA returns incorrect submatch result for second capturing group ((a*)|b) (no match instead of b at offset 1). Some alternative regular expressions that cause the same error: (((a*)|b)|b){1,2}, ((b|(a*))|b)+.

$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/  :? for help
Prelude> import Text.Regex.TDFA as T

The error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)*" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(-1,0))]]

The same error:

Prelude T> "ab"  T.=~ "^((b|(a*))|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

Again, the same error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b){1,2}" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

The second bug can be triggered by regular expression ((a?)(())*|a)+ and input string aa. Incorrect result is for second group (a?) (no match instead of a at offset 1), third group (()) and fourth group () (no match instead of empty match at offset 2). Alternative variant that also fails: ((a?()?)|a)+.

The error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a)+" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0)),(4,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "aa"  T.=~ "^((a?)(())*|a)*" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a){1,2}" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?()?)|a)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]
skvadrik commented 4 years ago

@neongreen, why did you close the bug?

skvadrik commented 4 years ago

@neongreen, Apologies, I looked at the wrong label --- you've just moved it here. Sorry! Feel free to delete these useless comments.