Closed danmoseley closed 2 years ago
That API is doing a full match. If I test ^(a*?)(a*?)$
in https://regex101.com/ I get the same result. Just the capturing groups are returned, btw.
return a match if the whole string matches the regular expression.
moreover, the API that has the behavior you expect is find
which would return two empty groups and the first match would be 0 .. -1
which means there is an empty match at index 0 (there is a field .boundaries
that returns the whole match boundaries, instead of it being part of the groups).
Also, nim-regex returns all group repetitions, not just the last one in some of the APIs. This is a feature I'll remove at some point.
Ah, thanks for the explanation.
Also, nim-regex returns all group repetitions, not just the last one in some of the APIs. This is a feature I'll remove at some point.
If I understand correctly, this is what .NET calls Capture (Match being the whole thing, Group for each of the capturing groups in the match). I wasn't previously aware of any other engine that did this. I assume it's useful to some people, but haven't researched to see how commonly.
Oh, in the other difference in behavior I have seen so far, I believe in this case Nim is correct: https://github.com/dotnet/runtime/issues/62104
Do you have thoughts?
If I understand correctly, this is what .NET calls Capture (Match being the whole thing, Group for each of the capturing groups in the match).
capturing groups are supported by most regex engines. What I was referring to is capturing group repetitions. For example match("abc", re("(\w)+"))
will return @[@["a"], @["b"], @["c"]]
instead of just @["c"]
like in most regex engines. The match macro API will return the last match, though.
Oh, in the other difference in behavior I have seen so far, I believe in this case Nim is correct: dotnet/runtime#62104
I believe JS behavior is correct. If you test ^(\w*)*$
you would expect the group to match aaab
, why would lookbehind behavior be different?
oh, I tested the pattern in https://regex101.com/ and the group is empty, interesting.
If you test ^(\w)$ you would expect the group to match aaab
So, JS group match aaab
, while PCRE and Python group match is empty.
nim-regex follows PCRE behavior so the lookbehind behavior may be a bug. But it follows JS behavior for lookarounds, so not sure. I need to dig into it.
capturing groups are supported by most regex engines. What I was referring to is capturing group repetitions. For example match("abc", re("(\w)+")) will return @[@["a"], @["b"], @["c"]] instead of just @["c"] like in most regex engines.
yes, that's what I meant. It seems that Nim and .NET support this. @stephentoub just curious do we have data on how often it's used in practice? It might be difficult to say.
Oh, in the other difference in behavior I have seen so far, I believe in this case Nim is correct: dotnet/runtime#62104
Looking at this test, it's the expected behavior. The rational is lookbehind is matched in reverse, so greediness is reversed.
If you test ^(\w)$ you would expect the group to match aaab, why would lookbehind behavior be different?
This is what I would expect, but then again, non-fixed lookarounds are only supported by JS and DLang. JS has this behavor so that's what I followed. Even though I don't I agree with it right now.
yes, that's what I meant. It seems that Nim and .NET support this. @stephentoub just curious do we have data on how often it's used in practice? It might be difficult to say.
I think it's a terrible feature. It makes nim-regex take O(N*M) space where N is the length of the text/string to match and M the length of the regex. Removing it will make it take O(M*C) space, where M is the length of the regex and C the number of capture groups.
https://github.com/nitely/nim-regex/blob/eeefb4f51264ff3bc3b36caf55672a74f52f5ef5/tests/tests.nim#L193
I'm porting Nim tests to improve the test bed for .NET's regexes (thanks!) and I found a discrepancy in behavior in this one.
I believe the first match should be at index 0, and the two capturing groups, being lazy, should be empty.
This is the behavior of .NET, and also other engines I tested using https://regex101.com/. From this test, it looks like in Nim the second capturing group captures the whole string. This is the behavior I would expect from
(a*?)(a*)
I didn't debug further, so perhaps I'm misunderstanding the test.