First commit contains two changes. First to ensure that . characters in the url and ip addresses match only the literal . vs interpreted as "any character". This is both a fix to make the patterns more precise, as well as more efficent.
The first commit also fixes the WWW_HOST TLD part of the pattern. Specifically ^s was switched to a-z. ^s will match any character except s, it is assumed that ^\s was expected (to match any non-whitespace character). However the a-z check is simpler and more precise.
The second commit addresses potential backtrack inefficiency by replacing [^>]* with [^>]*?. The additional ? will result in a non-greedy match (resulting in as minimal backtracking as possible to find a match).
After discussing with @dmihalcik-virtru we decided to further simplify WWW_HOST pattern. We are able to remove the www and negative www check, remove the group capturing, and simplify the subdomain pattern. Overall this notably improves the performance and understanding of this pattern.
This PR contains changes to address regex issues discovered by CodeQL:
Changes (see also commit messages):
.
characters in the url and ip addresses match only the literal.
vs interpreted as "any character". This is both a fix to make the patterns more precise, as well as more efficent.^s
was switched toa-z
.^s
will match any character excepts
, it is assumed that^\s
was expected (to match any non-whitespace character). However thea-z
check is simpler and more precise.[^>]*
with[^>]*?
. The additional?
will result in a non-greedy match (resulting in as minimal backtracking as possible to find a match).After discussing with @dmihalcik-virtru we decided to further simplify
WWW_HOST
pattern. We are able to remove thewww
and negativewww
check, remove the group capturing, and simplify the subdomain pattern. Overall this notably improves the performance and understanding of this pattern.