Failure to parse URL that doesn't start with scheme but contains a query that contains a URL with a scheme

splunk / utbox

URL Toolbox (UTBox) is a set of building blocks for Splunk specially created for URL manipulation. UTBox has been created to be modular, easy to use and easy to deploy in any Splunk environments.

https://preview.splunkbase.splunk.com/app/2734/

Apache License 2.0

8 stars 6 forks source link

Failure to parse URL that doesn't start with scheme but contains a query that contains a URL with a scheme #13

Open mlhdeveloper opened 2 weeks ago

mlhdeveloper commented 2 weeks ago

Here's a simple example URL that fails to parse:

somedomain.com/?g=http://somedomain.com/

If you add a scheme to the front, it then parses properly:

http://somedomain.com/?g=http://somedomain.com/

mlhdeveloper commented 2 weeks ago

I think the solution lies in changing this part so that it's only checking for :// near the beginning of the url and not anywhere in the entire url: https://github.com/splunk/utbox/blob/f8db838d28117f15fc406b6fc980d2963776ab37/utbox/bin/ut_parse_lib.py#L15 https://github.com/splunk/utbox/blob/f8db838d28117f15fc406b6fc980d2963776ab37/utbox/bin/ut_parse_lib.py#L257-L258

mlhdeveloper commented 2 weeks ago

I think this fixes it so that it's only checking for :// at the very beginning of the url or only after a scheme, i.e. only after any number of alphabetical or + characters (based on schemes handled by urllib.parse):

preg_rfc1808 = re.compile("^[a-z+]*://")