john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

1,2,3-octet/hexadecimal hostnames detected as IPv4 addresses #290

Closed elliotwutingfeng closed 1 year ago

elliotwutingfeng commented 1 year ago

The following inputs are recognized as IPv4 addresses due to the use of socket.inet_aton().

1.1.1 -> domain parsed as 1.1.1 1.1 -> domain parsed as 1.1 1 -> domain parsed as 1 (output is still correct nonetheless)

The above is legacy behavior from UNIX's inet_aton for classful networks, a network addressing architecture made obsolete in 1993.

01.01.01.01 -> domain parsed as 01.01.01.01 01.01.01 -> domain parsed as 01.01.01 01.01 -> domain parsed as 01.01 01 -> domain parsed as 01 (output is still correct nonetheless)

0x1.0x1.0x1.0x1 -> domain parsed as 0x1.0x1.0x1.0x1 0x1.0x1.0x1 -> domain parsed as 0x1.0x1.0x1 0x1.0x1 -> domain parsed as 0x1.0x1 0x1 -> domain parsed as 0x1 (output is still correct nonetheless)

Given that tldextract's regex-based ipv4() function only recognizes IPv4 addresses with 4 decimal octets without zero padding, this is probably a bug.

It can be fixed by using socket.inet_pton() in looks_like_ip() instead of socket.inet_aton(). However, it is only supported on Unix/Unix-Like/Windows systems. Some of these systems do not.

A more portable fix would be using ipaddress.IPv4Address, though it is much slower.

If suffix_index == len(labels) == 4, are there any edge cases not covered by IP_RE?

john-kurkowski commented 1 year ago

Thank you for the thorough report.

It can be fixed by using socket.inet_pton() in looks_like_ip() instead of socket.inet_aton(). However, it is only supported on Unix/Unix-Like/Windows systems. Some of these systems do not.

A more portable fix would be using ipaddress.IPv4Address, though it is much slower.

Maybe try socket.inet_pton, and if it's unavailable for the system, fall back to ipaddress.IPv4Address?