scrapy / w3lib

Python library of web-related functions
BSD 3-Clause "New" or "Revised" License
390 stars 104 forks source link

Fix tests with Python 3.11.4+ #213

Closed shadchin closed 1 year ago

shadchin commented 1 year ago

Fix #212

codecov[bot] commented 1 year ago

Codecov Report

Merging #213 (8abe8ee) into master (5393dbe) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 8abe8ee differs from pull request most recent head 6a5235a. Consider uploading reports for the commit 6a5235a to get more accurate results

@@           Coverage Diff           @@
##           master     #213   +/-   ##
=======================================
  Coverage   96.18%   96.18%           
=======================================
  Files           9        9           
  Lines         498      498           
  Branches       94       94           
=======================================
  Hits          479      479           
  Misses          9        9           
  Partials       10       10           
wRAR commented 1 year ago

Can you please run black?

shadchin commented 1 year ago

Done

Gallaecio commented 1 year ago

I have been delaying reviewing this change until I could put some time aside to properly look into it, as it felt to me like something was not right.

After a proper review, I think this change is actually perfect as it is, so thank you very much @shadchin.

But I feel the need to clarify, for future reference, what’s happening here since Python 3.11.4, which is 2 things:

  1. Python now fails as expected with invalid IPv6 domains. This is good news for us.
  2. Python now fails on user and password fields containing brackets ([]). This is bad news for us.

To elaborate on 2:

On Python 3.11.3 and earlier, safe_url_string was able to fix https://user[]:password[]@example.com as a browser would:

>>> safe_url_string("https://user[]:password[]@example.com")
'https://user%5B%5D:password%5B%5D@example.com'

Since 3.11.4, that’s not possible, the URL is unparsable altogether by the Python standard library:

>>> safe_url_string("https://user[]:password[]@example.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/adrian/temporal/w3lib-213/venv/lib/python3.11/site-packages/w3lib/url.py", line 142, in safe_url_string
    parts = urlsplit(_strip(decoded))
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '' does not appear to be an IPv4 or IPv6 address

This is OK for Python, but not for us, that strive to fix bad URLs where possible, and here it should be (and in fact used to be) possible.

I dream to some day have our own URL parsing implementation, with a good performance but a more browser-like behavior, so that we have control over things like this :sweat: .