john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

IPv6 addresses are not handled #263

Closed ohad-ivix closed 12 months ago

ohad-ivix commented 2 years ago

For URLs using IPv4 addresses, the host address gets extracted correctly using .domain but .fqdn gives the empty string:

>>> tldextract.extract("https://127.0.0.1:1234/foobar").domain
'127.0.0.1'
>>> tldextract.extract("https://127.0.0.1:1234/foobar").fqdn
''

For URLs using IPv6 addresses, neither method extracts the host address correctly:

>>> tldextract.extract("https://[::1]:1234/foobar").domain
'['
>>> tldextract.extract("https://[::1]:1234/foobar").fqdn
'' 
>>> tldextract.extract("https://[FEC0:0000:0000:0000:0000:0000:0000:0001]:1234/foobar").domain
'[FEC0'
>>> tldextract.extract("https://[FEC0:0000:0000:0000:0000:0000:0000:0001]:1234/foobar").fqdn
''

This was tested using tldextract version 3.2.1 on Python 3.9.12

john-kurkowski commented 2 years ago

Related: #74, #156

john-kurkowski commented 2 years ago

Agreed that those IPv6 outputs are garbage, like returning '[' or '[FEC0' for the domain. IPv6 has come up a couple times before. This library supports IPv4. It seems not too much extra work to support IPv6. I'm for it. 👍

john-kurkowski commented 2 years ago

Re: IPv4, your quoted output is intended.

Think of domain as the body of the IP address's ExtractResult; an IP has no subdomain or suffix.

fqdn is a computed property, "if there is a proper domain/suffix," or else it's the falsy, empty string you observed. If the extraction is an IP, do you think it makes sense for the computed property to contain some other (fully qualified domain) "name" for a number?

ohad-ivix commented 2 years ago

Ah, I was just naively assuming that "domain" < "fully-qualified domain". My bad for taking a random code snippet from SE and not really reading the docstring 😅. I did read the README, so maybe adding the examples there would have helped in my case…

I guess I wouldn't change how fqdn behaves in the case of IPs either. It's probably best to think of localhost as interchangeable with IPv4 127.0.0.1 or IPv6 [::1], so the empty string would be right here too.

ohad-ivix commented 2 years ago

What I needed in my use case was a way to get just the hostname (without user credentials or port number) that would work for all URLs, and fqdn kinda looked like what I wanted. Currently I'm doing post-processing on the result of urllib.parse.urlsplit instead.

Reading the code, netloc seems like what I was looking for, but I'm not sure if exposing it as a property is really in the mission statement for this library.

john-kurkowski commented 2 years ago

I hear you on the README examples. 👍

What I needed in my use case was a way to get just the hostname (without user credentials or port number) that would work for all URLs … Currently I'm doing post-processing on the result of urllib.parse.urlsplit instead.

Yup, urllib.parse.urlsplit(some_url).netloc is what you're looking for in the standard library. In this library, domain is equivalent for IPv4 (or, more generally, rejoining the entire tuple). Just need a fix for IPv6 to be treated the same.