Closed federicofusco closed 2 years ago
According to https://stackoverflow.com/a/27142527 a "url with many dots is valid. However a domain name with multiple consecutive dots is not valid since the length of each label has to be more than 0." So I'd say we should exclude them. I'm not a maintainer, though.
Yeah, we currently don't check much about the domain. There's a few issues that I think could all be solved by properly checking domains. Hopefully I can take a look at that soon.
So my initial thinking was: For http
and https
and similar schemes, require the host part to be valid according to DNS, i.e. reject things such as www.example..com
. For other schemes, we'd leave it open. But having read the relevant RFC, https://datatracker.ietf.org/doc/html/rfc3986#page-21 is interesting:
This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg- name beyond what is necessary for interoperability. Instead, it delegates the issue of registered name syntax conformance to the operating system of each application performing URI resolution, and that operating system decides what it will allow for the purpose of host identification. A URI resolution implementation might use DNS, host tables, yellow pages, NetInfo, WINS, or any other system for lookup of registered names. However, a globally scoped naming system, such as DNS fully qualified domain names, is necessary for URIs intended to have global scope. URI producers should use names that conform to the DNS syntax, even when use of DNS is not immediately apparent, and should limit these names to no more than 255 characters in length.
Especially given the last sentence, I was thinking we could mandate DNS syntax for the authority part regardless of scheme.
But then I had a look through https://en.wikipedia.org/wiki/List_of_URI_schemes and found this example: facetime://+19995551234
which would then be rejected.
So maybe we do need to distinguish schemes, and only apply strict checking for some schemes (http
, https
, file
, ftp
, sftp
...?), and for plain domain names and email addresses.
We could start with a conservative set of schemes (like http
and https
) and add more if we encounter missing edge cases.
However false positives are probably worse than false negatives when it comes to link detection. At the moment https://example..com
gets detected and I'd argue that facetime URLs are less common. So maybe rejecting consecutive periods for all schemes and adding facetime as an exception is the way to go?
@mre and @federicofusco I pushed a PR overhauling domain parsing, see here: https://github.com/robinst/linkify/pull/43
It would be awesome if you could give it a try and check for regressions!
Thanks for the PR! I'll check out the changes soon.
The fix for this has been released as 0.9.0, see here: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11
Love this library, although I found that it will match urls with consecutive periods in the domain (e.i
https://www.example..com
). I know that the RFCs are a mess with urls and (especially) emails so I was wondering if this is something to fix or done on purpose