Closed MegaIng closed 1 year ago
So many checks, many of which are hard to execute locally tbh:
tox -e mypy
works, although tox needs to be installed first (well, as long as the dependencies are added in the correct places)tox -e test
doesn't work, fails with some mysterious git submodule error (didn't look into this) Luckily, one of the 5 other ways to execute tests does work.tox -e format
doesn't work with the same error message. I don't know how to manually execute this, so I have to just push and see what github says.Oh, and the codecov workflow always fails with "nothing to run". Do you have an idea what is going on there?
Sorry, I didn't look into the codecov yet. I'm okay with just disabling it in the meantime.
As for tox, makes sense that mypy would pass but regular tests won't. Mypy doesn't run the tests.
The code looks good, but I'm not crazy about the warning message -
Collision between Terminals 'B' and 'A': Example(prefix='', main_text='', postfix='a')
I'm sure we can do better.
Doesn't seem like prefix and main_text are ever used? At least, I couldn't find an example where they have a value.
The warning should explain the consequences of the collision. e.g. "The lexer will choose between them arbitrarily."
And ofc better not to use the repr version of Example.
Doesn't seem like prefix and main_text are ever used? At least, I couldn't find an example where they have a value.
Uhm... This is then a bug in interegular. OTOH, I could have sworn the eample for A
and B
put the text into main_text
where it belongs. (Edit: aha yeah, I know what the bug is: -0 == 0
)
The idea behind the three categories is that main_text
is what actually get's captured by the regex and prefix
and postfix
are what is needed by lookbehind/lookahead. For example, the lark grammar without the change would produce a conflict on Example(prefix='', main_text='?', postfix='_')
The warning should explain the consequences of the collision. e.g. "The lexer will choose between them arbitrarily."
Fair, although I worry that a grammar would many collision would product a lot of warning text.
And ofc better not to use the repr version of Example.
I am unsure how to best format this. What prefix
and postfix
mean is kinda hard to explain in a short text. I guess I can do it multiline and use ^
to underline the main part, similar to python syntax errors.
Ok, I am now relatively happy with the warning message. It's decently complex code, but the result is IMO very readable. The formatting should now also never break, not even for newlines or weird unicode characters. Especially newlines is the reason I was using repr before.
I can add them here. Where would you want them?
Hmm yeah there's no obvious spot. I'd say -
I'm open to ideas.
Anyway, once that's in, I wanted to add a strict
flag, so Lark(..., strict=True)
will throw an exception when there are collisions. (regex collision and also shift-reduce). So then users can also learn about it through the API docs.
how_to_use#debug is certainly a place it should belong, I wasn't even really aware that that section exists.
I assume strict
is a separate PR?
Btw, the default config parser="earley", lexer="dynamic"
doesn't benefit from this at all, right? How bad are regex collisions in that context?
I assume strict is a separate PR?
I think so.
the default config parser="earley", lexer="dynamic" doesn't benefit from this at all, right?
Well, colliding regexes are still slower, because Earley has to try all variations. But it won't cause an error. So maybe a milder warning in that case? (assuming it doesn't complicate things too much)
So maybe a milder warning in that case? (assuming it doesn't complicate things too much)
I would have to look through the earley code base to find a good place to put this check. They don't construct a Lexer object after all, and especially not a ContextLexer
object. So I don't know how to correctly do this.
Umm actually that's only true for dynamic_complete. The default dynamic can still get errors from colliding regexes. (if it chooses the wrong one first)
(I'm talking about my comment)
I rephrased the docs a little bit, let me know if you're okay with the changes: https://github.com/lark-parser/lark/pull/1260
Also, I just realized we're only showing shift/reduce warnings when debug=True. Maybe we should use the same convention for interegular?
Yeah, those changes are good. I am not that skilled with words :-)
Maybe we should use the same convention for interegular?
I was thinking about this as well. I decided against it, since regex collisions are from my understanding more severe than shift/reduce collisions. I.e. for shift/reduce the automatic resolving probably will do the correct thing, for regex collision, the automatic resolving will maybe do the correct thing.
I think I agree. But I was also thinking about the possibility of interegular throwing exceptions, which might interfere with normal use. For example certain regex features it doesn't know about, or perhaps some edge case it wasn't expecting. If that's a possibility, then it would be nice to have a way to disable/enable it, that isn't installing/uninstalling it.
interegular should catch every exception it can throw and only produce a warning on it's internal logger. We probably should make the debug
flag make these warnings visible. I don't believe that performance could ever be bogged down enough by these checks to be worth having a disable flag for that. If the performance is that important, create a standalone parser.
We probably should make the debug flag make these warnings visible
That's not a bad idea. Makes sense that it would set the appropriate logger levels to DEBUG by default.
Can you please show me where this exception gets caught? https://github.com/MegaIng/interegular/blob/master/interegular/fsm.py#L431
Maybe I'm missing something obvious.
It's used internally inside of .fsm
to correctly crawl the (potentially incomplete) FSM.
Ah, yes. Missed that it's a callback.
Thanks for the awesome new feature!
I also published a new version of interegular that should double the performance again. :-D
This is basically the same as #715 , with (I think) all requested changes being considered. I also added an Example to the log output that can show in what situation two terminals would collide.
I created a new PR after having a mess after a git rebase. This just seemed easier.
I guess it should be mentioned somewhere in the docs, not sure where.