normalisation of urls containing non-ascii domains is broken and loses data

python-hyper / rfc3986

A Python Implementation of RFC3986 including validations

https://rfc3986.readthedocs.io/en/latest/

Other

185 stars 32 forks source link

normalisation of urls containing non-ascii domains is broken and loses data #23

Open wbolster opened 8 years ago

wbolster commented 8 years ago

Initial parsing works:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment')
URIReference(scheme='http', authority='æåëý.com', path='/path', query='query', fragment='fragment')

Subsequent normalisation silently loses data:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment').normalize()
URIReference(scheme='http', authority=None, path='/path', query='query', fragment='fragment')

sigmavirus24 commented 8 years ago

Correct. We do not yet handle IRIs. (RFC 3987)

wbolster commented 8 years ago

Fwiw, preprocessing by replacing the host name part with its IDNA-encoded (xn--…) equivalent using the url parsing routines from the urllib3 package, before passing it to uri_reference() sort of "works" as a work-around.