robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
206 stars 12 forks source link

Matches URLs with consecutive periods #41

Closed federicofusco closed 2 years ago

federicofusco commented 2 years ago

Love this library, although I found that it will match urls with consecutive periods in the domain (e.i https://www.example..com). I know that the RFCs are a mess with urls and (especially) emails so I was wondering if this is something to fix or done on purpose

mre commented 2 years ago

According to https://stackoverflow.com/a/27142527 a "url with many dots is valid. However a domain name with multiple consecutive dots is not valid since the length of each label has to be more than 0." So I'd say we should exclude them. I'm not a maintainer, though.

robinst commented 2 years ago

Yeah, we currently don't check much about the domain. There's a few issues that I think could all be solved by properly checking domains. Hopefully I can take a look at that soon.

robinst commented 2 years ago

So my initial thinking was: For http and https and similar schemes, require the host part to be valid according to DNS, i.e. reject things such as www.example..com. For other schemes, we'd leave it open. But having read the relevant RFC, https://datatracker.ietf.org/doc/html/rfc3986#page-21 is interesting:

This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg- name beyond what is necessary for interoperability. Instead, it delegates the issue of registered name syntax conformance to the operating system of each application performing URI resolution, and that operating system decides what it will allow for the purpose of host identification. A URI resolution implementation might use DNS, host tables, yellow pages, NetInfo, WINS, or any other system for lookup of registered names. However, a globally scoped naming system, such as DNS fully qualified domain names, is necessary for URIs intended to have global scope. URI producers should use names that conform to the DNS syntax, even when use of DNS is not immediately apparent, and should limit these names to no more than 255 characters in length.

Especially given the last sentence, I was thinking we could mandate DNS syntax for the authority part regardless of scheme.

But then I had a look through https://en.wikipedia.org/wiki/List_of_URI_schemes and found this example: facetime://+19995551234 which would then be rejected.

So maybe we do need to distinguish schemes, and only apply strict checking for some schemes (http, https, file, ftp, sftp ...?), and for plain domain names and email addresses.

mre commented 2 years ago

We could start with a conservative set of schemes (like http and https) and add more if we encounter missing edge cases.

However false positives are probably worse than false negatives when it comes to link detection. At the moment https://example..com gets detected and I'd argue that facetime URLs are less common. So maybe rejecting consecutive periods for all schemes and adding facetime as an exception is the way to go?

robinst commented 2 years ago

@mre and @federicofusco I pushed a PR overhauling domain parsing, see here: https://github.com/robinst/linkify/pull/43

It would be awesome if you could give it a try and check for regressions!

federicofusco commented 2 years ago

Thanks for the PR! I'll check out the changes soon.

robinst commented 2 years ago

The fix for this has been released as 0.9.0, see here: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11