Open JacobHearst opened 2 months ago
Thanks so much for this contribution, @JacobHearst!
Two additional notes outside of the implementation:
We'll need to have a discussion about whether and how to limit the kinds of patterns that can be used inside a lookbehind. Some other regex engines only allow fixed-length patterns (Perl, Python, e.g. /(?<=aa)b/
) or finite-length patterns (Java, e.g. /(?<=aa?)b/
), while others are more open-ended (.NET, Javascript). If we do want to limit the kinds of patterns, we might need an additional protocol to constrain the kinds of patterns that can appear in a Lookbehind
or NegativeLookbehind
regex builder block. (more info)
We'll also need to look at how to manage availability with literals, since older Swift runtimes won't have the necessary support for lookbehinds. (This is an issue that we would need to manage with support for any additional regex syntax — this is just the first one to raise the question.)
I think we want .NET/JS style lookbehind, that is you have the full expressive power of regex inside. This PR includes the standard optimizations we have (such as a run of scalars, compile time character classes, or quantifications of scalars or CCs).
That's a good point about availability, I'll find someone to follow up with.
I'm a little concerned about our bounds checking, as that's where we've introduced bugs in the past. I see that this checks for the start-of-string, but what about the start-of-substring for repeated searches? If we want to enumerate all the matches in a string, then lookbehind should be able to view previously-matched text, but we don't want to it to look outside of a Substring
's bounds. This seems analogous to anchors, what do you think @natecook1000
Here's a couple examples of nesting (https://regex101.com/r/ZnXq00/1). We'll want to expand this testing to cover even deeper nesting, test nesting alternations, custom character classes, etc., to make sure we get full coverage of our complex constructs:
input = "abcdefg"
// firstMatch results:
/abcd(?<=c(?=d)d)/ // abcd
/abcd(?<=cd(?=d).)/ // no match (. consumed the d)
/abcd(?<=c(?=e)d)/ // no match (local position expecting a lookahead of d)
/abcd(?<=bc(?=de)d)/ // abcd
/abcd(?<=bc(?=de).)/ // abcd
I'd suggest switching to the extended syntax (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0354-regex-literals.md#extended-delimiters-- ) or (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#matching-options) for the more complex nesting at some point
Here's what we need to ultimately ship:
@JacobHearst could you rebase this work on top of the current swift/main?
@natecook1000 when I try to build these tests, sometimes it works and sometimes I get the error:
Undefined symbol: static RegexBuilder.AlternationBuilder.buildExpression<A where A: _StringProcessing.RegexComponent>(A) -> A
... 31 more errors
have you seen this before?
A mostly complete implementation of lookbehind support for Regex via reverse matching.
Forewarning
I'm a casual user of regex, not an expert. I wanted lookbehinds in Swift and was willing to dive into a project I knew nothing about. As a result, I may have made assumptions or choices that don't seem to make sense to more knowledgeable readers.
Some notes
emitReverseXYZ
functions and theiremitXYZ
counterparts.String
'sendIndex
is still a valid index even though there's no character at that index. This means that the reverse variants of these matching functions need to take a different approach. I've tried to address this in some places (reverse quantifications), but others I haven't gotten to yet (^
assertions)