python / cpython

The Python programming language
https://www.python.org
Other
63.31k stars 30.3k forks source link

A re.error occurs when a proxy is used to access an IPv6 address in the Windows operating system. #103239

Open llchry opened 1 year ago

llchry commented 1 year ago

Bug report

Prerequisites

  1. Windows operating system
  2. A proxy server is used.
  3. The trustlist of the proxy service contains '[*'.

Use the following code to request an IPv6 address such as https://[fa::01]:80/,the following error occurs: re.error: unterminated character set at position 0

req = request.Request(url=url, method=method, headers=headers, data=data)
resp = request.urlopen(req, timeout=timeout, context=ctx)

I noticed that the problem comes from the proxy_bypass_registry method in urllib/request.py, which reads the system's proxy configuration and checks whether the host needs a proxy, but the regular expression [* works fine on windows, but fails to verify in python. I thought about configuring *] on windows, but this won't save, so I'm guessing there's a difference in the regular expression implementation between the two systems. So can we add an escape processing in the code and replace [ with \[?

Your environment

sobolevn commented 1 year ago

Can you please update your example to be actually runnable? A link to the exact place (where you think the problem is) is also highly appreciated! 👍

ronaldoussoren commented 1 year ago

From code inspection there might be a problem here: https://github.com/python/cpython/blob/810d365b5eb2cf3043957ca2971f6e7a7cd87d0d/Lib/urllib/request.py#L2768

The code in the loop converts entries in the system proxy override list into python regular expressions, and does not special case '[' which is a special character in regular expressions but not in the system settings.

See also: https://learn.microsoft.com/en-us/windows-hardware/customize/desktop/unattend/microsoft-windows-ie-clientnetworkprotocolimplementation-hklmproxyoverride

llchry commented 1 year ago

From code inspection there might be a problem here:

https://github.com/python/cpython/blob/810d365b5eb2cf3043957ca2971f6e7a7cd87d0d/Lib/urllib/request.py#L2768

The code in the loop converts entries in the system proxy override list into python regular expressions, and does not special case '[' which is a special character in regular expressions but not in the system settings.

See also: https://learn.microsoft.com/en-us/windows-hardware/customize/desktop/unattend/microsoft-windows-ie-clientnetworkprotocolimplementation-hklmproxyoverride

Yes, I'm sure that's the problem, I'm not going to refer to the code yet. I'd like to see if there's a special treatment for '[' here, but I'm not sure if there's any other issue, and I see there's a special treatment for .*?.

llchry commented 1 year ago

Can you please update your example to be actually runnable? A link to the exact place (where you think the problem is) is also highly appreciated! 👍

You can use the following code to reproduce the problem, provided that the address needs to be accessed using a proxy and your computer's proxy configuration trustlist contains '[*'.

image

from urllib import request
import ssl

if __name__ == "__main__":
    CTX = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
    CTX.load_cert_chain("D:\\Dev\\tools\\key\\kubecfg.crt", "D:\\Dev\\tools\\key\\kubecfg.key")
    req = request.Request(url="https://[fa::175:21:1:187]:9543/api/v1/namespaces/test/configmaps/test", method="GET", headers={}, data=None)
    resp = request.urlopen(req, context=CTX)
    print(resp)

The error stack information is as follows:

D:\temp\venv\Scripts\python.exe C:/Users/llchry/PycharmProjects/untitled/test.py
Traceback (most recent call last):
  File "C:\Users\llchry\PycharmProjects\untitled\test.py", line 8, in <module>
    resp = request.urlopen(req, context=CTX)
  File "D:\programs\python3\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "D:\programs\python3\lib\urllib\request.py", line 517, in open
    response = self._open(req, data)
  File "D:\programs\python3\lib\urllib\request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "D:\programs\python3\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "D:\programs\python3\lib\urllib\request.py", line 802, in <lambda>
    meth(r, proxy, type))
  File "D:\programs\python3\lib\urllib\request.py", line 810, in proxy_open
    if req.host and proxy_bypass(req.host):
  File "D:\programs\python3\lib\urllib\request.py", line 2773, in proxy_bypass
    return proxy_bypass_registry(host)
  File "D:\programs\python3\lib\urllib\request.py", line 2758, in proxy_bypass_registry
    if re.match(test, val, re.I):
  File "D:\programs\python3\lib\re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "D:\programs\python3\lib\re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "D:\programs\python3\lib\sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "D:\programs\python3\lib\sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "D:\programs\python3\lib\sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "D:\programs\python3\lib\sre_parse.py", line 550, in _parse
    raise source.error("unterminated character set",
re.error: unterminated character set at position 0
llchry commented 1 year ago

In my case, adding a line below this would solve my problem. Any other better suggestions? test = test.replace("[", r"\[")

https://github.com/python/cpython/blob/c00dcf0e381a090f7e905f887a98bbf63c88af5a/Lib/urllib/request.py#L2774

ronaldoussoren commented 1 year ago

That would fix this particular issue, but I'm a bit worried about other special characters in regular expressions, even if those shouldn't be in a proxy exclude list in practice. That said, I'm not an urllib or windows expert (I don't even have access to Windows to test any changes on).

The code block below defines a function to_re that could replace this problematic loop, but is not tested beyond the two print statements at the end. I'm not convinced yet that using this would be an improvement to the code though, especially for a bug fix that will be back ported to stable releases.

Btw. I initially looked at fnmatch to replace the loop, but that supports character classes as well which aren't wanted here.


import re

_SPECIAL = {
    '*': '.*',
    '?': '.',
}
def to_re(hostname_pattern):
    return "".join(
        _SPECIAL[x] if x in _SPECIAL else re.escape(x)
        for x in re.split(r"([.*])", hostname_pattern))

print(f'{to_re("*.python.org")=}')
print(f'{to_re("[*")=}')