Open mjpieters opened 1 year ago
@orsenthil (as an urllib module expert)
I think this should return a valid object only if the hostname starts with a [
and ends with ]
. With the current logic any string with [
and ]
inside the hostname is being considered as a valid IPv6 or IPvFutire hostnames. I believe we must verify the start and end characters of a hostname as mentioned here for IPv6 and optionally IPvFuture.
Just to add to this thread,
I'm using django, django-environ and postgres. This issue is quite significant since I, unfortunately, happen to have brackets inside the postgres passwords in several prod/staging servers. This password is given inside a URL to django-environ that parses it (using urllib) then sends it back to django.
As far as Python 3.11.3, everything was going well, but since 3.11.4 it's all broken now.
It is related to what is stated above. urlsplit now spots '[' and ']' inside the netloc and what's inside theses brackets (a fragment of the password) is wrongfully considered a hostname. It then throws an exception since it tries to convert it to an ip address.
The lines that seem to be of importance : in urllib/parse.py -- urlsplit()
if '[' in netloc and ']' in netloc:
bracketed_host = netloc.partition('[')[2].partition(']')[0]
_check_bracketed_host(bracketed_host)
It feels like an important breaking change/regression that doesn't seem documented
Just to add to this thread,
I'm using django, django-environ and postgres. This issue is quite significant since I, unfortunately, happen to have brackets inside the postgres passwords in several prod/staging servers. This password is given inside a URL to django-environ that parses it (using urllib) then sends it back to django.
As far as Python 3.11.3, everything was going well, but since 3.11.4 it's all broken now.
It is related to what is stated above. urlsplit now spots '[' and ']' inside the netloc and what's inside theses brackets (a fragment of the password) is wrongfully considered a hostname. It then throws an exception since it tries to convert it to an ip address.
The lines that seem to be of importance : in urllib/parse.py -- urlsplit()
if '[' in netloc and ']' in netloc: bracketed_host = netloc.partition('[')[2].partition(']')[0] _check_bracketed_host(bracketed_host)
It feels like an important breaking change/regression that doesn't seem documented
I experienced the same problem. Just to be sure I also checked the URL spec referenced in the code and it states that username and password can be an ASCII string, so this is clearly a bug.
@orsenthil Is it OK if I open a PR for this issue for you to look at?
@bcail , yes please.
@orsenthil I opened a PR.
Hi @orsenthil. Are you going to be able to take a look at my PR? If not, in a week or so I can post on Discourse and see if someone else can take a look. Thanks.
103848 updated the URL parsing algorithm to handle IPv6 and IPvFuture addresses when parsing URLs.
However, the algorithm is incomplete.
[
and]
are only permitted in the hostname portion if they are the first and last characters and only if they then contain an IPv6 or IPvFuture address. The current implementation ignores everything before the first[
and everything after the first]
found in thenetloc
portion.The WhatWG URL standard states that
[
and]
are forbidden characters in a hostname, and the host parser only looks for IPv6 or IPvFuture if the[
and]
characters are the first and last characters of the section, respectively.The current implementation thus accepts such bizarre hostnames as:
http://prefix.[v1.example]/
http://[v1.example].postfix/
but then only reports the portion between the brackets as the hostname:
The
.netloc
attribute, in both cases, contains the whole string.Both URLs should have been rejected instead.
Your environment
Linked PRs