robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
201 stars 12 forks source link

decimal matches link without scheme #46

Closed sqwishy closed 1 year ago

sqwishy commented 2 years ago

Hi I noticed a number with a decimal point is picked up as a link as long as it's preceded by whitespace.

not a link

1.0

is a link

 1.0

image

This seems like a false positive. Particularly because of the whitespace thing. Also I don't know why "1.0" would be a link but it also matches "1.2.3" regardless of whitespace and I'm not so sure about that one either.

edit okay so my understanding is something like 1.1 isn't really a valid address. it works sometimes ... like you can ping 1.1 and it will try pinging 1.0.0.1 because of a quirk in networking stacks ... but it's like an accident or something. ipcalc 1.1 will tell you "bad IPv4 address"; cool. But then url::Url::parse("http://1.1") parses that as a URL with 1.0.0.1 as the host. So now nothing makes sense.

robinst commented 1 year ago

Hey, thanks for reporting. That the difference depends on the whitespace is a bug for sure.

Apart from that, yeah maybe we should require 4 numbers for IP address like hostnames. Or maybe including IP addresses was a mistake for scheme-less URLs. We could exclude them and make it an option that can be enabled if it's really wanted.

What do you think?

mre commented 1 year ago

Not the issue creator, but I think that a separate feature for IP addresses would make the most sense.

like you can ping 1.1 and it will try pinging 1.0.0.1 because of a quirk in networking stacks ... but it's like an accident or something.

For a fun journey down the rabbit hole, read this article. The gist is that if an octet gets omitted, it is set to 0. Sometimes.

As per the article, these are all things that will ping 127.0.0.1:

So, yeah, we should stay on the more conservative end of the spectrum I guess.

Demanding 4 octets is what Python requires.

>>> import ipaddress
>>> ipaddress.IPv4Address('127.1')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/ipaddress.py", line 1301, in __init__
    self._ip = self._ip_int_from_string(addr_str)
  File "/usr/lib/python3.7/ipaddress.py", line 1135, in _ip_int_from_string
    raise AddressValueError("Expected 4 octets in %r" % ip_str)
ipaddress.AddressValueError: Expected 4 octets in '127.1'

(By now I quoted almost the entire article, but I hope it helped.)

In summary, I'd go with exactly 4 octets and a separate feature for IP addresses.

sqwishy commented 1 year ago

We could exclude them and make it an option that can be enabled if it's really wanted.

IPv4 addresses shorter than four octets are surprising/unexpected for most people so they shouldn't be parsed by default.

But the original point was that 1.0 was not matched on its own but it did if it had a space before it. I don't have strong feelings about how addresses are parsed. I don't use that feature of linkify.

robinst commented 1 year ago

Changed in 0.10.0, IP addresses now require exactly 4 parts: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#0100---2023-06-24