nitely / nim-regex

Pure Nim regex engine. Guarantees linear time matching
https://nitely.github.io/nim-regex/
MIT License
225 stars 20 forks source link

Trailing lazy capturing group acting as greedy #109

Closed danmoseley closed 2 years ago

danmoseley commented 2 years ago

https://github.com/nitely/nim-regex/blob/eeefb4f51264ff3bc3b36caf55672a74f52f5ef5/tests/tests.nim#L193

I'm porting Nim tests to improve the test bed for .NET's regexes (thanks!) and I found a discrepancy in behavior in this one.

I believe the first match should be at index 0, and the two capturing groups, being lazy, should be empty.

This is the behavior of .NET, and also other engines I tested using https://regex101.com/. From this test, it looks like in Nim the second capturing group captures the whole string. This is the behavior I would expect from (a*?)(a*)

I didn't debug further, so perhaps I'm misunderstanding the test.

nitely commented 2 years ago

That API is doing a full match. If I test ^(a*?)(a*?)$ in https://regex101.com/ I get the same result. Just the capturing groups are returned, btw.

nitely commented 2 years ago

From the docs:

return a match if the whole string matches the regular expression.

nitely commented 2 years ago

moreover, the API that has the behavior you expect is find which would return two empty groups and the first match would be 0 .. -1 which means there is an empty match at index 0 (there is a field .boundaries that returns the whole match boundaries, instead of it being part of the groups).

Also, nim-regex returns all group repetitions, not just the last one in some of the APIs. This is a feature I'll remove at some point.

danmoseley commented 2 years ago

Ah, thanks for the explanation.

Also, nim-regex returns all group repetitions, not just the last one in some of the APIs. This is a feature I'll remove at some point.

If I understand correctly, this is what .NET calls Capture (Match being the whole thing, Group for each of the capturing groups in the match). I wasn't previously aware of any other engine that did this. I assume it's useful to some people, but haven't researched to see how commonly.

danmoseley commented 2 years ago

Oh, in the other difference in behavior I have seen so far, I believe in this case Nim is correct: https://github.com/dotnet/runtime/issues/62104

Do you have thoughts?

nitely commented 2 years ago

If I understand correctly, this is what .NET calls Capture (Match being the whole thing, Group for each of the capturing groups in the match).

capturing groups are supported by most regex engines. What I was referring to is capturing group repetitions. For example match("abc", re("(\w)+")) will return @[@["a"], @["b"], @["c"]] instead of just @["c"] like in most regex engines. The match macro API will return the last match, though.

Oh, in the other difference in behavior I have seen so far, I believe in this case Nim is correct: dotnet/runtime#62104

I believe JS behavior is correct. If you test ^(\w*)*$ you would expect the group to match aaab, why would lookbehind behavior be different?

nitely commented 2 years ago

oh, I tested the pattern in https://regex101.com/ and the group is empty, interesting.

nitely commented 2 years ago

If you test ^(\w)$ you would expect the group to match aaab

So, JS group match aaab, while PCRE and Python group match is empty.

nim-regex follows PCRE behavior so the lookbehind behavior may be a bug. But it follows JS behavior for lookarounds, so not sure. I need to dig into it.

danmoseley commented 2 years ago

capturing groups are supported by most regex engines. What I was referring to is capturing group repetitions. For example match("abc", re("(\w)+")) will return @[@["a"], @["b"], @["c"]] instead of just @["c"] like in most regex engines.

yes, that's what I meant. It seems that Nim and .NET support this. @stephentoub just curious do we have data on how often it's used in practice? It might be difficult to say.

nitely commented 2 years ago

Oh, in the other difference in behavior I have seen so far, I believe in this case Nim is correct: dotnet/runtime#62104

Looking at this test, it's the expected behavior. The rational is lookbehind is matched in reverse, so greediness is reversed.

If you test ^(\w)$ you would expect the group to match aaab, why would lookbehind behavior be different?

This is what I would expect, but then again, non-fixed lookarounds are only supported by JS and DLang. JS has this behavor so that's what I followed. Even though I don't I agree with it right now.

yes, that's what I meant. It seems that Nim and .NET support this. @stephentoub just curious do we have data on how often it's used in practice? It might be difficult to say.

I think it's a terrible feature. It makes nim-regex take O(N*M) space where N is the length of the text/string to match and M the length of the regex. Removing it will make it take O(M*C) space, where M is the length of the regex and C the number of capture groups.