Linkify punycode-encodes em-dash

robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly

https://robinst.github.io/linkify/

Apache License 2.0

207 stars 12 forks source link

Linkify punycode-encodes em-dash #28

Open alexwennerberg opened 2 years ago

alexwennerberg commented 2 years ago

Hi! Thanks for this library -- I use it in my new mailing list software to detect links in emails. Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Here's my code: https://git.alexwennerberg.com/crabmail/file/src/utils.rs.html#l22

I think that this could unambiguously be parsed, but I'm not 100% sure. What do you think?

robinst commented 2 years ago

Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Hmm that link doesn't load for me, could you provide a copy here?

alexwennerberg commented 2 years ago

Ah, sorry -- I shuffled things around a bit on my site. Here's the fixed link:

https://lists.flounder.online/crabmail/threads/BD094616-4ACC-47F9-BE79-6C61A66A76D7@paritybit.ca.html

robinst commented 2 years ago

I see. That's an interesting case, because — can currently be part of an URL, e.g. like this:

https://www.example.com/—

In that case, the whole text including em-dash would get linked.

Also note that GitHub behaves the same way here:

https://www.example.com— https://www.example.com/—

We could fix the case where it's part of the domain, see also #29 which has some discussion around that. But what would you expect with the case where it's part of the path?

alexwennerberg commented 2 years ago

I think that if it's part of the path it should be treated as such. I guess this is a broader question, whether this library should reject invalid TLDs? like:

https://lists.flounder.online/test/threads/YgYAU45J+dZURu1F@localhost.lan.html

I think that the tradeoffs that you've made with the library as written are reasonable though

robinst commented 2 years ago

I've reworked domain parsing in 0.9.0 (see https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11), but I haven't addressed this yet.

I think we could now do this by rejecting TLDs that contain non-alphanumeric Unicode characters. Note that there are TLDs that contain non-ASCII characters, see examples here (but they would be alphanumeric): https://en.wikipedia.org/wiki/Internationalized_country_code_top-level_domain