lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Limit amount of time/effort interegular will use when creating an example #1266

Closed MegaIng closed 1 year ago

MegaIng commented 1 year ago

Use limit_depth when creating an example for a collision to prevent unbounded memory and time usage.

An example for two regex which do have a collision, but which can take a long time to find the example for are:

A: /(?:#( |\t)*(?!if|ifdef|else|elif|endif|define|set|unset|error|exit)[^\n]+|(;|\\/\\/)[^\n]*)/
B: /\\#(?:(?:(?:\\ |\t))+)?(?i:error)/i

Coming from this grammar: https://github.com/adbancroft/TunerStudioIniParser/blob/master/ts_ini_parser/grammars/pre_processor.lark

However, even with this particular change, loading that grammar still takes a few seconds since almost all regex collide with each other.

The reason for many of these collisions is that the grammar is unclear whether or not the keywords are case insensitive, so all these warnings and collisions are actually fully accurate and using #ErRor instead of #error will cause problems for the grammar.

erezsh commented 1 year ago

So perhaps we can limit ourselves to checking a fixed amount of collisions, say 8, after which we'll just write something like "8 regex collisions reached; disabling detection. More collisions may exist."

Also, I think we can reduce the example search time to 200ms or similar, it's already quite a bit of time. But maybe for strict-mode put it even higher at 2000ms, because then we only have to call it once.

What do you think?

MegaIng commented 1 year ago

Yeah, putting a limit on the total amount of collisions is something I also considered.

For search time, I sadly don't have a good measurement of the parameter I can control -> time it takes. But I could try and see if I can figure out something.

MegaIng commented 1 year ago

Ok, I made a few improvements. It now should be relatively reliable, and I think there are a few changes I can do in just interegular to make it so that it manages to provide examples for all collisions in the above mentioned grammar very quickly, where previously it only manage to report one or two. Via testing, I calculated an approximation for max_time-> max_iterations. This should be decently reliable unless the hardware or python implementation is very slow.