robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
206 stars 12 forks source link

final dot is stripped from link leading to 404 in `lychee` #57

Closed soredake closed 1 year ago

soredake commented 1 year ago

Originally reported here: https://github.com/lycheeverse/lychee/issues/940

affected links for test: https://ru.wikipedia.org/wiki/%D0%9F%D0%BE%D1%81%D0%BB%D0%B5_%D0%B4%D0%BE%D0%B6%D0%B4%D0%B8%D1%87%D0%BA%D0%B0,_%D0%B2_%D1%87%D0%B5%D1%82%D0%B2%D0%B5%D1%80%D0%B3... https://archive.org/details/23the.amazing.spiderman.the.deadly.dust.

robinst commented 1 year ago

linkify is for extracting links out of plain text, where normally you wouldn't wanna include a trailing ".". That's the whole point of this library, handling ambiguous cases how humans would normally expect it. If we changed the behavior for this, we'd get other bug reports saying "linkify shouldn't include trailing dots".

Note that GitHub behaves the same way here: https://archive.org/details/23the.amazing.spiderman.the.deadly.dust.

I think the fix needs to be in lychee itself, like this: If it gets a 404, and the extracted link was followed by a lone ".", also try the link including the dot. If that is a valid link, then accept it. In other words, try both variants and if one is fine, pass the check.