lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
1.86k stars 115 forks source link

Mastodon link is interpreted as email address #1462

Open almereyda opened 5 days ago

almereyda commented 5 days ago

When running Lychee on degrowth.net/organisations/instituto-resiliencia/index.html, it reports, that the email address cannot be reached:

Errors in public/organisations/instituto-resiliencia/index.html

[ERR] instituto_resiliencia@15-15-15.social | Failed: Unreachable mail address: instituto_resiliencia@15-15-15.social: Invalid: Email doesn't exist or is syntactically incorrect

https://github.com/degrowth/idn-website/pull/23

This is a special case, since the string that is recognised is only used as the label for a hyperlink to https://15-15-15.social/@instituto_resiliencia. The link clearly does not point at a mailto: URI.

This is because of a design choice made in how Mastodon addresses are displayed. The @ is shown directly before of the string that is recognised as an email address, but not part of the link anchor.

grafik

<p>Mastodon: @<a href=[https://15-15-15.social/@instituto_resiliencia](view-source:https://15-15-15.social/@instituto_resiliencia) target=_blank>instituto_resiliencia@15-15-15.social</a><br>Twitter: @<a href=[https://twitter.com/ins_resiliencia](view-source:https://twitter.com/ins_resiliencia) target=_blank>ins_resiliencia</a></p>

The HTML in action:

Mastodon: @instituto_resiliencia@15-15-15.social
Twitter: @ins_resiliencia

It appears check-if-email-exists, of which an outdated fork exists at github.com/lycheeverse/check-if-email-exists, which has fallen back considerably behind github.com/reacherhq/check-if-email-exists, does not recognise that the link points some place else. Eventually by grepping the whole file for a regular expression matching an email pattern, not considering the semantics of the HTML.

What would be the best way to proceed from here?

almereyda commented 5 days ago

One way to circumvent this is to set include_mail to false.

But can the programmatic regression of not being able to distinguish HTML links to URLs from links to mailto: URIs be remediated?