robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
206 stars 12 forks source link

Github action package versions categorized as mail address #29

Closed mre closed 2 years ago

mre commented 2 years ago

In the README.md for lychee-action, there is a YAML file with the following content: lycheeverse/lychee-action@v1.1.1. This gets detected as an email address. It's technically correct, because there is no upper limit on the number of dots in an address and a TLD is also not strictly required. However I'd argue that it's at least a surprising edge case.

Python considers it valid

Python 3.9.9 (main, Nov 21 2021, 03:16:13)
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.utils import parseaddr
>>> parseaddr('lycheeverse/lychee-action@v1.1.1')
('', 'lycheeverse/lychee-action@v1.1.1')

while PHP doesn't:

php -r "var_dump(filter_var('lycheeverse/lychee-action@v1.1.1', FILTER_VALIDATE_EMAIL));"
bool(false)

Would you say it makes sense to add a heuristic for this case? Otherwise I'd add a check over on lychee itself.

lebensterben commented 2 years ago

I wish the following would help when grepping email addresses:


According to https://datatracker.ietf.org/doc/html/rfc3696#section-3, an email address is of the form local-part@domain.


https://datatracker.ietf.org/doc/html/rfc3696#section-2 gives the restrictions on domain part of an email address.


Length wise, section 2 and 3 of RFC 3696 says:

lebensterben commented 2 years ago

For example, the following "URI" is not a valid email address:

github.com/golangci/golangci-lint/cmd/golangci-lint@v1.43.0

When parsing it, we found it's not quoted by ". And we only see ASCII alphabetical characters, -, and / until encounter @.

It also happens to be that the substring before @ is 51 bytes long and so it's okay to assume it's the local-part of an email address.

Then it sees v1.43.0 which seems like the domain part with three labels. Everything seems okay except that the last label, or the root, should be a TLD which cannot be all-numeric!.

Thus v1.43.0 cannot be the domain part of an email address and the whole string is not a valid email address.

robinst commented 2 years ago

Thus v1.43.0 cannot be the domain part of an email address and the whole string is not a valid email address.

Hmm yeah. It might be worth looking at adding some code trying to detect this. Note that in the README, we explicitly state that we don't try to support IP addresses and quoting, so it's ok for us to exclude something like @127.0.0.1.

So if we only support domain names, we can check if the last part of the domain (TLD) is numeric only, and if it is, reject it. Here's an extensive comment with references supporting that: https://stackoverflow.com/a/53875771/305973

Does anyone want to have a go at this?

lebensterben commented 2 years ago

More bugs:


A tried with the following cases one by one:

hi@example.org- -> mailto:hi@example.org and -. (Incorrect, since domain label cannot ends with - so this entire string is invalid)

hi@-example.org -> hi@-example.org (Correct)

hi@-example-.org -> hi@-example-.org (Correct)

hi.s@-example.org- -> https://robinst.github.io/linkify/hi.s@-example.org- (Incorrect, domain label cannot start or end with - so the entire string is invalid)


Then I tried with putting them together:

hi@example.org-
hi@-example.org-
hi@-example.org
hi.s@-example.org-

->

mailto:hi@example.org
hi@-example.org-
hi@-example.org
hi.s@-example.org-

Note that this time the last line gets different results as it's no longer parsed as URL.

robinst commented 2 years ago

hi@example.org- is linkified as hi@example.org-, which you could argue is similar to e.g. how hi@example.org) is treated: The email is extracted and ends before the illegal character, and the extra character is treated as delimiter text. So I don't see a problem there. You could argue that we should reject it in the case of - but I'm not sure about that.

hi.s@-example.org- being linked is actually because the Allow URLs without scheme (e.g. example.com) option is turned on and it's recognized as a URL. If you turn that off, it's no longer recognized. That's definitely a bug. We should probably do stricter domain checking for both emails and scheme-less URLs.

robinst commented 2 years ago

A fix for this has been released with 0.9.0, see here: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11

lebensterben commented 2 years ago

Thanks