python-validators / validators

Python Data Validation for Humans™.
MIT License
977 stars 155 forks source link

URLs validation should use urlparse #189

Closed waynew closed 1 year ago

waynew commented 3 years ago

Python already parses URLs, and does it correctly:

>>> from urllib.parse import urlparse
>>> urlparse('https://google.com')
ParseResult(scheme='https', netloc='google.com', path='', params='', query='', fragment='')
>>> urlparse('gopher://gopher.waynewerner.com')
ParseResult(scheme='gopher', netloc='gopher.waynewerner.com', path='', params='', query='', fragment='')
>>> urlparse('tel://555-555-5555')
ParseResult(scheme='tel', netloc='555-555-5555', path='', params='', query='', fragment='')
>>> urlparse('file:///path-to-some-file')
ParseResult(scheme='file', netloc='', path='/path-to-some-file', params='', query='', fragment='')
>>> urlparse('missing-scheme.com')
ParseResult(scheme='', netloc='', path='missing-scheme.com', params='', query='', fragment='')

I had to chase down this library because click-params uses validators to validate URLs, but totally valid URLs aren't parsed correctly because the scheme wasn't expected by this library :disappointed:

rcirca commented 3 years ago

It doesn't validate though, parses sure, but validating with it is not good. http:////.google.com would be considered valid based on urllib.

waynew commented 3 years ago

You're right, it would, and is a valid URL:

>>> p.urlparse('http:////.google.com')
ParseResult(scheme='http', netloc='', path='//.google.com', params='', query='', fragment='')

path might not be a valid domain name, but that's an entirely different problem. Interestingly enough, http:////.google.com works fine if you type it into your address bar in Chrome, though it fails if you click that link. http:////google.com works, though.

rcirca commented 3 years ago

well, technically it's wrong to place that for the 'path', google.com should be the 'netloc'? Yeah, without the period it works fine in chrome, but not in safari 😅

waynew commented 3 years ago

Well, it is the path, strictly speaking, and should be rejected because there is no netloc.

Just because it's a real URL doesn't mean you can get there from here 🤣

waynew commented 1 year ago

Awesome! Thanks for your efforts 🎉🚀👍