Closed mre closed 2 years ago
I wish the following would help when grepping email addresses:
According to https://datatracker.ietf.org/doc/html/rfc3696#section-3, an email address is of the form local-part@domain
.
local-part
contains ASCII printable characters except @ \ " , [ ]
:
! # $ % & ' * + - / = ? ^ _ ` . { | } ~
.
cannot be the first or last character of local-part
local-part
if it's quoted
/
, e.g. foo\@@example.com
local-part
can be quoted by double quotes, e.g. "foo bar@example.com
https://datatracker.ietf.org/doc/html/rfc3696#section-2 gives the restrictions on domain
part of an email address.
domain
consists of at least one label
separated by .
, e.g. foo.bar.com
.
.
is used to signify the last label
being the root (TLD), e.g. example.com.
.label
may contain only ASCII alphanumerical characters and -
.
label
may contain non-ASCII characters for i18n.labels
accept ASCII alphanumerical character, -
, and any other unicode characters except special characters, control characters, or punctuation characters.Length wise, section 2 and 3 of RFC 3696 says:
local-part
has a maximum length of 64 bytes.domain
has a maximum length of 255 bytes.
domain
has a maximum length of 63 bytes.For example, the following "URI" is not a valid email address:
github.com/golangci/golangci-lint/cmd/golangci-lint@v1.43.0
When parsing it, we found it's not quoted by "
. And we only see ASCII alphabetical characters, -
, and /
until encounter @
.
It also happens to be that the substring before @
is 51 bytes long and so it's okay to assume it's the local-part
of an email address.
Then it sees v1.43.0
which seems like the domain
part with three labels. Everything seems okay except that the last label, or the root, should be a TLD which cannot be all-numeric!.
Thus v1.43.0
cannot be the domain
part of an email address and the whole string is not a valid email address.
Thus v1.43.0 cannot be the domain part of an email address and the whole string is not a valid email address.
Hmm yeah. It might be worth looking at adding some code trying to detect this. Note that in the README, we explicitly state that we don't try to support IP addresses and quoting, so it's ok for us to exclude something like @127.0.0.1
.
So if we only support domain names, we can check if the last part of the domain (TLD) is numeric only, and if it is, reject it. Here's an extensive comment with references supporting that: https://stackoverflow.com/a/53875771/305973
Does anyone want to have a go at this?
More bugs:
A tried with the following cases one by one:
hi@example.org-
-> mailto:hi@example.org
and -
. (Incorrect, since domain label cannot ends with -
so this entire string is invalid)
hi@-example.org
-> hi@-example.org
(Correct)
hi@-example-.org
-> hi@-example-.org
(Correct)
hi.s@-example.org-
-> https://robinst.github.io/linkify/hi.s@-example.org-
(Incorrect, domain label cannot start or end with -
so the entire string is invalid)
Then I tried with putting them together:
hi@example.org-
hi@-example.org-
hi@-example.org
hi.s@-example.org-
->
mailto:hi@example.org
hi@-example.org-
hi@-example.org
hi.s@-example.org-
Note that this time the last line gets different results as it's no longer parsed as URL.
hi@example.org-
is linkified as hi@example.org-, which you could argue is similar to e.g. how hi@example.org)
is treated: The email is extracted and ends before the illegal character, and the extra character is treated as delimiter text. So I don't see a problem there. You could argue that we should reject it in the case of -
but I'm not sure about that.
hi.s@-example.org-
being linked is actually because the Allow URLs without scheme (e.g. example.com) option is turned on and it's recognized as a URL. If you turn that off, it's no longer recognized. That's definitely a bug. We should probably do stricter domain checking for both emails and scheme-less URLs.
A fix for this has been released with 0.9.0, see here: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11
Thanks
In the
README.md
for lychee-action, there is a YAML file with the following content:lycheeverse/lychee-action@v1.1.1
. This gets detected as an email address. It's technically correct, because there is no upper limit on the number of dots in an address and a TLD is also not strictly required. However I'd argue that it's at least a surprising edge case.Python considers it valid
while PHP doesn't:
Would you say it makes sense to add a heuristic for this case? Otherwise I'd add a check over on lychee itself.