Closed shadchin closed 1 year ago
Merging #213 (8abe8ee) into master (5393dbe) will not change coverage. The diff coverage is
n/a
.:exclamation: Current head 8abe8ee differs from pull request most recent head 6a5235a. Consider uploading reports for the commit 6a5235a to get more accurate results
@@ Coverage Diff @@
## master #213 +/- ##
=======================================
Coverage 96.18% 96.18%
=======================================
Files 9 9
Lines 498 498
Branches 94 94
=======================================
Hits 479 479
Misses 9 9
Partials 10 10
Can you please run black
?
Done
I have been delaying reviewing this change until I could put some time aside to properly look into it, as it felt to me like something was not right.
After a proper review, I think this change is actually perfect as it is, so thank you very much @shadchin.
But I feel the need to clarify, for future reference, what’s happening here since Python 3.11.4, which is 2 things:
[]
). This is bad news for us.To elaborate on 2:
On Python 3.11.3 and earlier, safe_url_string
was able to fix https://user[]:password[]@example.com
as a browser would:
>>> safe_url_string("https://user[]:password[]@example.com")
'https://user%5B%5D:password%5B%5D@example.com'
Since 3.11.4, that’s not possible, the URL is unparsable altogether by the Python standard library:
>>> safe_url_string("https://user[]:password[]@example.com")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/adrian/temporal/w3lib-213/venv/lib/python3.11/site-packages/w3lib/url.py", line 142, in safe_url_string
parts = urlsplit(_strip(decoded))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/urllib/parse.py", line 500, in urlsplit
_check_bracketed_host(bracketed_host)
File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/ipaddress.py", line 54, in ip_address
raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '' does not appear to be an IPv4 or IPv6 address
This is OK for Python, but not for us, that strive to fix bad URLs where possible, and here it should be (and in fact used to be) possible.
I dream to some day have our own URL parsing implementation, with a good performance but a more browser-like behavior, so that we have control over things like this :sweat: .
Fix #212