swiftlang / swift-experimental-string-processing

An early experimental general-purpose pattern matching engine for Swift.
Apache License 2.0
278 stars 47 forks source link

[WIP] Regex lookbehind support #760

Open JacobHearst opened 2 months ago

JacobHearst commented 2 months ago

A mostly complete implementation of lookbehind support for Regex via reverse matching.

Forewarning

I'm a casual user of regex, not an expert. I wanted lookbehinds in Swift and was willing to dive into a project I knew nothing about. As a result, I may have made assumptions or choices that don't seem to make sense to more knowledgeable readers.

Some notes

natecook1000 commented 2 months ago

Thanks so much for this contribution, @JacobHearst!

Two additional notes outside of the implementation:

  1. We'll need to have a discussion about whether and how to limit the kinds of patterns that can be used inside a lookbehind. Some other regex engines only allow fixed-length patterns (Perl, Python, e.g. /(?<=aa)b/) or finite-length patterns (Java, e.g. /(?<=aa?)b/), while others are more open-ended (.NET, Javascript). If we do want to limit the kinds of patterns, we might need an additional protocol to constrain the kinds of patterns that can appear in a Lookbehind or NegativeLookbehind regex builder block. (more info)

  2. We'll also need to look at how to manage availability with literals, since older Swift runtimes won't have the necessary support for lookbehinds. (This is an issue that we would need to manage with support for any additional regex syntax — this is just the first one to raise the question.)

milseman commented 2 months ago
  1. I think we want .NET/JS style lookbehind, that is you have the full expressive power of regex inside. This PR includes the standard optimizations we have (such as a run of scalars, compile time character classes, or quantifications of scalars or CCs).

  2. That's a good point about availability, I'll find someone to follow up with.

I'm a little concerned about our bounds checking, as that's where we've introduced bugs in the past. I see that this checks for the start-of-string, but what about the start-of-substring for repeated searches? If we want to enumerate all the matches in a string, then lookbehind should be able to view previously-matched text, but we don't want to it to look outside of a Substring's bounds. This seems analogous to anchors, what do you think @natecook1000

milseman commented 2 months ago

Here's a couple examples of nesting (https://regex101.com/r/ZnXq00/1). We'll want to expand this testing to cover even deeper nesting, test nesting alternations, custom character classes, etc., to make sure we get full coverage of our complex constructs:

input = "abcdefg"

// firstMatch results:

/abcd(?<=c(?=d)d)/     // abcd
/abcd(?<=cd(?=d).)/    // no match (. consumed the d)
/abcd(?<=c(?=e)d)/     // no match (local position expecting a lookahead of d)
/abcd(?<=bc(?=de)d)/   // abcd
/abcd(?<=bc(?=de).)/   // abcd

I'd suggest switching to the extended syntax (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0354-regex-literals.md#extended-delimiters-- ) or (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#matching-options) for the more complex nesting at some point

milseman commented 2 months ago

Here's what we need to ultimately ship:

  1. Expand testing. The large amount of semi-duplicated code in the engine for handling reverse match variants is acceptable if we have really good testing.
  2. Reject custom regex consumers in the builder syntax from being embedded in a lookbehind, as these are not reversible.
  3. Limit availability. We need to error out at parse time for literals that use lookbehind when targeting an OS that may not support them.
milseman commented 2 months ago

@JacobHearst could you rebase this work on top of the current swift/main?

@natecook1000 when I try to build these tests, sometimes it works and sometimes I get the error:

Undefined symbol: static RegexBuilder.AlternationBuilder.buildExpression<A where A: _StringProcessing.RegexComponent>(A) -> A
... 31 more errors

have you seen this before?