tweag / nickel

Better configuration for less
https://nickel-lang.org/
MIT License
2.43k stars 93 forks source link

`std.string.find` doesn't properly handle optional capture groups #2108

Closed jneem closed 2 days ago

jneem commented 3 days ago

The regex (a+)(b)? should match the string "aa", and std.string.is_match agrees. But:

nickel> std.string.find "(a+)(b)?" "aa"
{ groups = [  ], index = -1, matched = "", }
jneem commented 3 days ago

I looked into this a little, and the problem is that the grapheme-cluster-checking code filters out matches having any empty captures. (There's another place where we're filtering out unmatched capture groups, which is also a problem because it messes up indexing)

Overall, this shouldn't be too hard to fix. One potential issue is how to represent missing optional captures in the return value. Empty strings is the obvious choice that matches the current type signature, but it means there's no way to distinguish an optional capture that matched an empty string from an optional capture that didn't match.

yannham commented 3 days ago

Ah, this isn't great. This might mandate a patch version.

there's no way to distinguish an optional capture that matched an empty string from an optional capture that didn't match

My brain might be foggy, but I'm having a hard time seeing the difference between those two in practice. Do you have examples of both an optional capture that match an empty string and an optional capture that didn't match, to see the difference more clearly?

As a side note, we want to change the type of find anyway to use an optional-like ADT as a return value. Once we move to Nickel 2.0, we can do both changes at once (and have groups be an ADT as well if needed). In the meantime, I don't see an easy solution that doesn't break backward compatibility. While on paper adding a field in the output to the result should be more or less ok on the value side (we could have, I don't know, a list of indices of one of the two instances of the empty string cases to disambiguate), adding a field to the return type isn't backward compatible.

jneem commented 3 days ago

Do you have examples of both an optional capture that match an empty string and an optional capture that didn't match, to see the difference more clearly?

I can't think of an example where you'd care about the difference, but one artificial example is (b(a*)|c(a*)). This has 3 capture groups (not counting the implicit one for the whole match), and if you're matching the string "b" then capture group 1 matched "b", capture group 2 matched the empty string, and capture group 3 didn't match anything.

Once we move to Nickel 2.0, we can do both changes at once

Agreed, but I also want to solve this now so that I can parse semvers :smile:

yannham commented 3 days ago

Thanks for the example. I would vote for using empty strings, even if indeed this feels a bit wrong and should probably be Option String or Nullable String in the future, but it looks like it's the best we can do without breaking changes.

Agreed, but I also want to solve this now so that I can parse semvers 😄

Right, I'm talking about breaking changes here - of course we must fix the current issue in one way or another :slightly_smiling_face: