sopel-irc / sopel

:robot::speech_balloon: An easy-to-use and highly extensible IRC Bot framework. Formerly Willie.
https://sopel.chat
Other
948 stars 402 forks source link

safety: `safeify_url()` function can't handle (invalid) URLs with `[hostname]` placeholder #2644

Open dgw opened 1 week ago

dgw commented 1 week ago

If someone sends a link while safety is active in the channel, and the URL contains a placeholder hostname in square brackets, Sopel will spit out an "Unexpected ValueError" message. Note: Seems to happen only on Python 3.11 or higher.

This is from the safeify_url() function added in #2279, which uses urllib.parse.urlparse() to make sanitizing URL parts easier, which in turn uses ipaddress.ip_address() to raise an error for bracketed IPv4 addresses—and trips on this weird edge case:

https://github.com/sopel-irc/sopel/blob/e7d86481d186f2faafa01cc46131485d2b4966d4/sopel/builtins/safety.py#L127-L133

Simple examples using the Python console:

>>> # Python 3.10
>>> safety.safeify_url('http://[Target-IP]/cgi-bin/account_mgr.cgi')
'hxxp://[Target-IP]/cgi-bin/account_mgr.cgi'
>>> # Python 3.11
>>> safety.safeify_url('http://[Target-IP]/cgi-bin/account_mgr.cgi')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dgw/github/sopel-irc/sopel/sopel/builtins/safety.py", line 129, in safeify_url
    parts = urlparse(url)
            ^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: 'Target-IP' does not appear to be an IPv4 or IPv6 address

The version inconsistency is going to be the worst part of designing a "correct" fix for this. A simple fallback approach (such as return url.replace('http', 'hxxp', 1) if url.startswith('http') else url) will miss more complicated cases that are still handled fine in older Python versions (output using 3.10 shown here):

>>> safety.safeify_url('http://[Target.IP]/cgi-bin/account_mgr.cgi')  # dot gets bracketed
'hxxp://[Target[.]IP]/cgi-bin/account_mgr.cgi'
>>> safety.safeify_url('http://[Target:IP]/cgi-bin/account_mgr.cgi')  # colon gets bracketed
'hxxp://[Target[:]IP]/cgi-bin/account_mgr.cgi'

_Do note though that all this is an edge case of an edge case. People must intentionally construct these invalid URLs, and can be trained to simply use another type of bracket for placeholders instead, such as http://<Target-IP>/cgi-bin/account_mgr.cgi._

half-duplex commented 6 days ago

PR opened to be more graceful about it, but considering http://<foo>/ and http://999.999.999.999/ are parsed successfully, I think this is a urllib bug?

>>> urlparse("http://<test>/")
ParseResult(scheme='http', netloc='<test>', path='/', params='', query='', fragment='')
dgw commented 6 days ago

but considering http://<foo>/ and http://999.999.999.999/ are parsed successfully, I think this is a urllib bug?

Square brackets are special. The reported "error" URL is actually invalid per RFC 3986 § 3.2.2:

A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax.

The only thing that's supposed to go in square brackets in a URI is an IPv6 (or later) literal, full stop. You could probably even get away with not "safeifying" these invalid links at all, since they can't be followed anyway.

And yes, I know that still leaves us with inconsistent behavior between different Python versions. But Python apparently decided urllib should be stricter about following the URI spec. 🤷‍♂️