Closed Zac-HD closed 6 years ago
Hi! I'm glad you found greenery
to have some entertainment value. I consider this to be a completed project at the moment, and it's been in "if it ain't broke, don't touch it" mode for several years.
If it does have bugs, then let's fix them. I can think of one case where x == parse(str(x))
would return False
but I would be unconcerned, for example when x
is charclass('a')
but parse(str(x))
is a full-blown pattern
object, pattern(conc(mult(charclass('a'), multiplier(bound(1), bound(1)))))
. I don't know if this is the kind of thing you're talking about. But if you have some examples which look more like bugs, we can take a look at fixing them.
I see in one case you've suggested changing a test where parsing "a.b()()"
would result in a lego
object where the empty subpatterns had been actively suppressed. In my opinion this introduces a x != str(parse(x))
bug because the empty subpatterns in the original have been removed after a round trip. In general I would prefer to keep parsing and reducing regular expressions as separate operations, and in fact not reduce
ing the returned lego
object at parsing time was an intentional decision.
Alright then! In that case I'll keep new features in my own copy, but bring tests and bugs upstream. Glad to set expectations up front :smile:
I'd encourage you to have a look at the property-tests
branch to reproduce this (and check that this isn't the only bug), but here's the test case:
a = conc(charclass('0'), pattern(charclass('1'), charclass('0')))
# str(a) == '00|1' # <- there's the bug - should be '0(0|1)'!
b = parse(str(a))
assert a.reduce() == b.reduce() # Fails
Note that str(parse('0(0|1)')) == '0(0|1)'
- it's not obvious to me that you can get this bug in a chunk of lego parsed from a string, but it's certainly possible from direct construction and I suspect from reduction too.
I would like to use the lego.equivalent
method for the assertion, but it's incredibly slow on non-trivial inputs - I didn't leave it running overnight, but otherwise the test wouldn't finish :disappointed: Possibly another issue to look into here?
(And I should note: happy to look into these myself if you have some tips about where to start!)
Right, I think the problem there is probably that pattern(charclass('1'), charclass('0'))
isn't throwing an exception. A pattern
should be a sequence of conc
objects, it shouldn't be permissible to use a charclass
there. I see that even pattern('a', 'b')
and pattern(782, 0)
seem to have no problems and they all str
ify just fine. It looks like there's almost no input validation going on in the lego
constructors. Which I think might have been intentional? Possibly I wasn't thinking about people building these objects by hand except when I do it internally for testing purposes? The constructors themselves are not documented, only lego.parse
is.
lego.equivalent
first converts the two lego
objects to finite state machines, which is potentially very time-consuming when large numbers of possible characters are involved. It then computes the symmetric difference of the two FSMs, also potentially time-consuming, and finally checks that the result is empty, which is effectively instantaneous. However, in this case, I find that parse('00|1').equivalent(parse('0(0|1)'))
returns False
pretty much instantly.
Sounds like usage errors on my end then! Thanks again for greenery
, and for your help here :smile:
I finally picked up
greenery
this afternoon, and have had a lot of fun with it. Before I spend much more time on it though I thought I'd check if you're interested in contributions:I wrote an alternative string-to-lego parser which leans heavily on the CPython
sre_parse
module - https://github.com/qntm/greenery/compare/master...Zac-HD:pyregex. It supports every construct that can bere.compile
d, but has somewhat worse errors at the moment (eg no context given when bailing on a groupref). Also needs more testing for eg repeats :smile:I also wrote some property-based tests with Hypothesis, https://github.com/qntm/greenery/compare/master...Zac-HD:property-tests. This has already turned up a few bugs of the form
x != parse(str(x))
for some lego object x, but there's little point looking for more if you don't consider this a bug worth fixing. (I originally started this on the parser branch, but similar problems seem to exist on master too)Either way, thanks for a great little library and a fun evening poking at it!