lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Added interegular support #1258

Closed MegaIng closed 1 year ago

MegaIng commented 1 year ago

This is basically the same as #715 , with (I think) all requested changes being considered. I also added an Example to the log output that can show in what situation two terminals would collide.

I created a new PR after having a mess after a git rebase. This just seemed easier.

I guess it should be mentioned somewhere in the docs, not sure where.

MegaIng commented 1 year ago

So many checks, many of which are hard to execute locally tbh:

Oh, and the codecov workflow always fails with "nothing to run". Do you have an idea what is going on there?

erezsh commented 1 year ago

Sorry, I didn't look into the codecov yet. I'm okay with just disabling it in the meantime.

As for tox, makes sense that mypy would pass but regular tests won't. Mypy doesn't run the tests.

erezsh commented 1 year ago

The code looks good, but I'm not crazy about the warning message -

Collision between Terminals 'B' and 'A': Example(prefix='', main_text='', postfix='a')

I'm sure we can do better.

MegaIng commented 1 year ago

Doesn't seem like prefix and main_text are ever used? At least, I couldn't find an example where they have a value.

Uhm... This is then a bug in interegular. OTOH, I could have sworn the eample for A and B put the text into main_text where it belongs. (Edit: aha yeah, I know what the bug is: -0 == 0) The idea behind the three categories is that main_text is what actually get's captured by the regex and prefix and postfix are what is needed by lookbehind/lookahead. For example, the lark grammar without the change would produce a conflict on Example(prefix='', main_text='?', postfix='_')

The warning should explain the consequences of the collision. e.g. "The lexer will choose between them arbitrarily."

Fair, although I worry that a grammar would many collision would product a lot of warning text.

And ofc better not to use the repr version of Example.

I am unsure how to best format this. What prefix and postfix mean is kinda hard to explain in a short text. I guess I can do it multiline and use ^ to underline the main part, similar to python syntax errors.

MegaIng commented 1 year ago

Ok, I am now relatively happy with the warning message. It's decently complex code, but the result is IMO very readable. The formatting should now also never break, not even for newlines or weird unicode characters. Especially newlines is the reason I was using repr before.

MegaIng commented 1 year ago

I can add them here. Where would you want them?

erezsh commented 1 year ago

Hmm yeah there's no obvious spot. I'd say -

I'm open to ideas.

Anyway, once that's in, I wanted to add a strict flag, so Lark(..., strict=True) will throw an exception when there are collisions. (regex collision and also shift-reduce). So then users can also learn about it through the API docs.

MegaIng commented 1 year ago

how_to_use#debug is certainly a place it should belong, I wasn't even really aware that that section exists.

I assume strict is a separate PR?

MegaIng commented 1 year ago

Btw, the default config parser="earley", lexer="dynamic" doesn't benefit from this at all, right? How bad are regex collisions in that context?

erezsh commented 1 year ago

I assume strict is a separate PR?

I think so.

the default config parser="earley", lexer="dynamic" doesn't benefit from this at all, right?

Well, colliding regexes are still slower, because Earley has to try all variations. But it won't cause an error. So maybe a milder warning in that case? (assuming it doesn't complicate things too much)

MegaIng commented 1 year ago

So maybe a milder warning in that case? (assuming it doesn't complicate things too much)

I would have to look through the earley code base to find a good place to put this check. They don't construct a Lexer object after all, and especially not a ContextLexer object. So I don't know how to correctly do this.

erezsh commented 1 year ago

Umm actually that's only true for dynamic_complete. The default dynamic can still get errors from colliding regexes. (if it chooses the wrong one first)

(I'm talking about my comment)

erezsh commented 1 year ago

I rephrased the docs a little bit, let me know if you're okay with the changes: https://github.com/lark-parser/lark/pull/1260

Also, I just realized we're only showing shift/reduce warnings when debug=True. Maybe we should use the same convention for interegular?

MegaIng commented 1 year ago

Yeah, those changes are good. I am not that skilled with words :-)

Maybe we should use the same convention for interegular?

I was thinking about this as well. I decided against it, since regex collisions are from my understanding more severe than shift/reduce collisions. I.e. for shift/reduce the automatic resolving probably will do the correct thing, for regex collision, the automatic resolving will maybe do the correct thing.

erezsh commented 1 year ago

I think I agree. But I was also thinking about the possibility of interegular throwing exceptions, which might interfere with normal use. For example certain regex features it doesn't know about, or perhaps some edge case it wasn't expecting. If that's a possibility, then it would be nice to have a way to disable/enable it, that isn't installing/uninstalling it.

MegaIng commented 1 year ago

interegular should catch every exception it can throw and only produce a warning on it's internal logger. We probably should make the debug flag make these warnings visible. I don't believe that performance could ever be bogged down enough by these checks to be worth having a disable flag for that. If the performance is that important, create a standalone parser.

erezsh commented 1 year ago

We probably should make the debug flag make these warnings visible

That's not a bad idea. Makes sense that it would set the appropriate logger levels to DEBUG by default.

erezsh commented 1 year ago

Can you please show me where this exception gets caught? https://github.com/MegaIng/interegular/blob/master/interegular/fsm.py#L431

Maybe I'm missing something obvious.

MegaIng commented 1 year ago

https://github.com/MegaIng/interegular/blob/ffde88d87a1eb4209fc3bdad2153643b4944c064/interegular/fsm.py#L889

It's used internally inside of .fsm to correctly crawl the (potentially incomplete) FSM.

erezsh commented 1 year ago

Ah, yes. Missed that it's a callback.

erezsh commented 1 year ago

Thanks for the awesome new feature!

MegaIng commented 1 year ago

I also published a new version of interegular that should double the performance again. :-D