swiftlang / swift-experimental-string-processing

An early experimental general-purpose pattern matching engine for Swift.
Apache License 2.0
274 stars 45 forks source link

Regex parser should be wary of combining characters #303

Closed milseman closed 2 years ago

milseman commented 2 years ago

The regex parser is based over Character, which is fine, but means that some programs could put combining scalars following meta characters and those will not compare equal. We have a few options:

The latter seems simpler and there's an easy (and highly advisable!) fall back path of representing the combining scalar through an escape.

milseman commented 2 years ago

@hamishknight do you think you could throw a few tests together and implement the second alternative?

hamishknight commented 2 years ago

@milseman Sure, how do we want to classify a metacharacter scalar? Should we limit it to [](){}^$.|\ for now, or just any non-letter non-number non-\r\n ASCII? I'm inclined to say the latter, what do you think?

milseman commented 2 years ago

Hmm, we also have contextual meta-characters as well as escapes. If we have try Regex("\\d\u{301}") what should we do?

(where the \u{301} is not a regex literal scalar but a string literal scalar)

hamishknight commented 2 years ago

@milseman I believe that's already rejected due to being an unknown letter escape:

https://github.com/apple/swift-experimental-string-processing/blob/09a385bf4ffcc3d0d26e54a592ea87f9cc50c948/Tests/RegexTests/ParseTests.swift#L2340

milseman commented 2 years ago

Great! I don't know of an organic case for ASCII [^a-zA-Z0-9] with a combining scalar or anything like that. It seems like a good first approximation and we can relax it in the future. Alternatively, we could do it for all meta or contextually-meta characters.

There's also a canonical equivalent for ?, IIRC. @Azoy do you recall the details?

Azoy commented 2 years ago

There's also a canonical equivalent for ?, IIRC. @Azoy do you recall the details?

I believe there are scalars who have compatibility equivalents with ?, but I don't believe there are any canonical ones.