swiftlang / swift-experimental-string-processing

An early experimental general-purpose pattern matching engine for Swift.
Apache License 2.0
278 stars 47 forks source link

Regex with positive lookahead crashes at runtime when accessing match.output #713

Closed AndreasVerhoeven closed 7 months ago

AndreasVerhoeven commented 9 months ago

Description

Using a regex with a positive lookahead sometimes crashes @ runtime. See the example in the reproduction

Reproduction

let regex = /(?=([1-9]|(a|b)))/
let input = "Something 9a"
let matches = input.matches(of: regex)
for match in matches {
    print(match.output) // accessing `.output` here crashes at runtime: Thread 1: EXC_BREAKPOINT (code=1, subcode=0x225246848)
}

Stack dump

Thread 1: EXC_BREAKPOINT (code=1, subcode=0x225246848)

Expected behavior

No crash

Environment

swift-driver version: 1.87.1 Apple Swift version 5.9 (swiftlang-5.9.0.128.108 clang-1500.0.40.1) Target: arm64-apple-macosx14.0

Additional information

No response

natecook1000 commented 8 months ago

This also reproduces with just /(?=(9))/ for the regex.

Don't have a fix yet, but found the cause... The issue appears to be that a positive lookahead is implemented as:

      ...
0:    save(restoringAt: success)
1:    save(restoringAt: intercept)
2:    <sub-pattern>    // any failure restores at 'intercept'
3:    clearThrough(intercept) // remove intercept and any leftovers from <sub-pattern>
4:    fail             // -> 'success'
5:  intercept:
6:    clearSavePoint   // remove 'success' restore point 
7:    fail             // propagate failure
8:  success:
      ...

The fail at (4) is the path of success through the lookahead – that instruction drops the position (and other state) back to where it was at the start of the lookahead pattern, and then moves the instruction pointer to (0), which advances the instruction pointer to (8), where pattern matching continues. Unfortunately, the state restoration in the fail also resets the capture group information, erasing any capture data that was saved while matching the lookahead pattern.

When you try to access the match output, the loss of that capture data causes a runtime failure, since any successful match must have both the overall range (empty in this case) and the capture formed during the lookahead (which is just the 9 in this simplified regex).