jech / galene

The Galène videoconference server
https://galene.org
MIT License
944 stars 130 forks source link

URL regex is not considering ponctuaction #76

Open erdnaxe opened 3 years ago

erdnaxe commented 3 years ago
const urlRegexp = /https?:\/\/[-a-zA-Z0-9@:%/._\\+~#&()=?]+[-a-zA-Z0-9@:%/_\\+~#&()=]/g;

This regex does not seem to always work. For example, this link is correctly considered by Github Markdown parser, but not by Galène:

We need to have a quite complex regex as we don't want to consider trailing dots, <> characters... If I find a better URL regex, I will post it here.

erdnaxe commented 3 years ago

It turned out that the problem might not come from the regex but from the fact that the regex is applied on the non-encoded URL.

This is correctly parsed by Galène : https://example.com/Lettre%C3%80%C3%89lise This is not correctly parsed : https://example.com/LettreÀÉlise

jech commented 3 years ago

There's the coding issue, which is due to the fact that I don't know how to do Unicode regexps in Javascript. There's also the issue of punctuation, but this one needs to preserve punctuation at the end of URLs:

I'd like you to check https://galene.org. As mentioned on https://galene.org, Pion is great. Pion (see https://pion.ly) is great.

But

Please see https://en.wikipedia.org/wiki/Silver_Streak_(film)

I need help with this.

erdnaxe commented 3 years ago

Found this StackOverflow post with some link to interesting libraries: https://stackoverflow.com/questions/37684/how-to-replace-plain-urls-with-links/21925491#21925491

We could use a library such as anchorme.js which seems to be rather accurate but it adds a lot of code. Maybe we would rather want something smaller but with lower accuracy? For example, do we need to check URL against IANA list? Do we need to have the list of all existing TLDs (https://github.com/alexcorvi/anchorme.js/blob/gh-pages/src/tlds.ts)?

For Unicode support, this lib seems to do this: https://github.com/alexcorvi/anchorme.js/blob/gh-pages/src/dictionary.ts#L29

If we don't need all this extra verification, I might try to do a striped down/simpler fork of anchorme.js for Galène as the code seems rather clean.

erdnaxe commented 3 years ago

I just noticed that my terminal emulator (Alacritty) is matching URL quite well. Looking at the code, it's using https://github.com/chrisduerr/rfind_url/ which consist of one Rust file to match URLs. It does not look that complex, but it's definitely more than just a simple regex.