Closed jneem closed 2 days ago
I looked into this a little, and the problem is that the grapheme-cluster-checking code filters out matches having any empty captures. (There's another place where we're filtering out unmatched capture groups, which is also a problem because it messes up indexing)
Overall, this shouldn't be too hard to fix. One potential issue is how to represent missing optional captures in the return value. Empty strings is the obvious choice that matches the current type signature, but it means there's no way to distinguish an optional capture that matched an empty string from an optional capture that didn't match.
Ah, this isn't great. This might mandate a patch version.
there's no way to distinguish an optional capture that matched an empty string from an optional capture that didn't match
My brain might be foggy, but I'm having a hard time seeing the difference between those two in practice. Do you have examples of both an optional capture that match an empty string and an optional capture that didn't match, to see the difference more clearly?
As a side note, we want to change the type of find
anyway to use an optional-like ADT as a return value. Once we move to Nickel 2.0, we can do both changes at once (and have groups be an ADT as well if needed). In the meantime, I don't see an easy solution that doesn't break backward compatibility. While on paper adding a field in the output to the result should be more or less ok on the value side (we could have, I don't know, a list of indices of one of the two instances of the empty string cases to disambiguate), adding a field to the return type isn't backward compatible.
Do you have examples of both an optional capture that match an empty string and an optional capture that didn't match, to see the difference more clearly?
I can't think of an example where you'd care about the difference, but one artificial example is (b(a*)|c(a*))
. This has 3 capture groups (not counting the implicit one for the whole match), and if you're matching the string "b" then capture group 1 matched "b", capture group 2 matched the empty string, and capture group 3 didn't match anything.
Once we move to Nickel 2.0, we can do both changes at once
Agreed, but I also want to solve this now so that I can parse semvers :smile:
Thanks for the example. I would vote for using empty strings, even if indeed this feels a bit wrong and should probably be Option String
or Nullable String
in the future, but it looks like it's the best we can do without breaking changes.
Agreed, but I also want to solve this now so that I can parse semvers 😄
Right, I'm talking about breaking changes here - of course we must fix the current issue in one way or another :slightly_smiling_face:
The regex
(a+)(b)?
should match the string "aa", andstd.string.is_match
agrees. But: