ua-parser / uap-go

Go implementation of ua-parser
Other
357 stars 106 forks source link

Consider transforming regexes for better finite automaton compatibility #91

Open masklinn opened 1 month ago

masklinn commented 1 month ago

I understand that Go's regexp packages uses finite automaton. Working on uap-rust, I learned that uap-core's extensive use of large bounded repetition (.{x,y} with a large y) causes a state explosion in FA engines (which makes sense), you can see some of it at https://github.com/rust-lang/regex/discussions/1206#discussioncomment-9976284 as I asked the rust-regex author for assistance (I was investigating the large memory use of my rust-regex-based implementation compared to a C++ re2 implementation).

Assuming Go's regexp is inspired by or derived from re2, it should be much less sensible than rust-regexp, and should not have the performance traps / cliffs of jitted DFA, but I would still expect that the bounded repetitions significantly increase memory use for no value.

Experimentally, in uap-rust I found that converting repetitions with an upper bound of 100 or more (so 3+ digits) had no functional implication, with all of those having been added to mitigate catastrophic backtracking. Below 100 things are more risky, I found at least one regex which uses a functional repetition with an upper bound of 50.