nitely / regexy

:wavy_dash: Linear time regex matching supporting streams and other goodies
MIT License
9 stars 1 forks source link

Unexpected nested repeating sub-matches #8

Closed nitely closed 7 years ago

nitely commented 7 years ago

Nested capturing groups should return nested values when repeated. To illustrate:

match(r'(a(b(c)*)*)*', 'abbccabbcc')
# (('abbcc', 'abbcc'), ('b', 'bcc', 'b', 'bcc'), ('c', 'c', 'c', 'c'))

# expected:
# ((abbcc, abbcc), ((b, bcc), (b, bcc)), ((None, (c, c)), (None, (c, c))))

So, it's not that bad, everything gets captured at least. The fix should be fairly easy to do (I hope!).

nitely commented 7 years ago

However I'm not sure about this. I would like to check how RE2 handles this.

Edit: It seems it actually works as intended. It does make sense why I would do it that way. But consider a file full of duck and say I want to know how many ducks there are per line. I'd write something like:

match(r'(?:(duck)*\n?)*', stream)  # This does not even work right now since \n is not supported

if the file has two lines with two ducks per line then I'd (currently) get something like ((duck, duck, duck, duck),) instead of (((duck, duck), (duck, duck)),) . I would very much like to know how many ducks per line are :smile_cat:

nitely commented 7 years ago

The more I think about this the more I think it's ok as it is. The initial goal was just to simply capture all sub-matches instead of just the last one (what python does). There are other issues with the behavior described here. For example let's say the previous example file filled with ducks has a bunch of empty lines... then we would get stuff like (((duck, duck), None, None, None, (duck, duck)), None, None) and that's required for consistency with other matches, so no way around that and I don't know how useful it would actually be.

So things stay the same. Closing.