gregjacobs / Autolinker.js

Utility to Automatically Link URLs, Email Addresses, Phone Numbers, Twitter handles, and Hashtags in a given block of text/HTML
MIT License
1.48k stars 238 forks source link

It doesn't work when url contains non english chars #89

Closed feridmovsumov closed 8 years ago

feridmovsumov commented 9 years ago

For example this url: http://en.wikipedia.org/wiki/Mustafa_Kemal_Atatürk

nerdegutt commented 9 years ago

Having a nordic web site, I have the same issue with our characters, i.e. æøåÆØÅ. I am managing to fix it hacking your regex'es, but that isn't a proper solution (adding hundreds of chars to support every language). Could a solution be to allow us to add a set of legal url characters in config for our particular site?

thatkookooguy commented 9 years ago

@nerdegutt That sounds like a great solution to me. I guess having an array as input that will include a set of letters that you want to enable extra to the default ones (since special characters usually appear in addition to English characters in links) can help support other languages.

for example, some Israeli sites use Israeli letters in them like this one: http://www.המדריך.co.il/

newpen commented 8 years ago

@Thatkookooguy not a solution for Chinese links, and links with languages that don't have letters (most Asian languages)

simison commented 8 years ago

Here are some thoughts — especially this comment.

Also https://github.com/joelarson4/CharFunk

...or simply could the link just go on until there's whitespace?

eteeselink commented 8 years ago

Hi guys!

I've spent some hours giving this a go and it got me seriously nerd sniped. I'd love if some of your could spare some time to share your opinion.

So, here's the deal: Autolinker actually does it right. URLs aren't allowed to contain ü or 河. Only a-zA-Z0-9 and some punctuation. However, browsers got smart, and automatically URLencode stuff people enter with non-RFC characters. Try to open https://zh.wikipedia.org/wiki/淡水河 and look in your browser's devtools what address actually got requested.

attempt 1

Currently, the regex that matters here looks like this:

urlSuffixRegex = /[\-A-Za-z0-9+&@#\/%=~_()|'$*\[\]?!:,.;]*[\-A-Za-z0-9+&@#\/%=~_()|'$*\[\]]/;

(here)

I figured, we can just replace A-Za-z0-9 with some unicode character classes that contain all characters, in all languages, that are considered "alphanumeric". Turns out this question is not so easy to answer! The SO comment @simison links to contains such a list, but how to tell if it makes any sense at all? I've tried to dig into this but haven't found a conclusive answer yet.

attempt 2

But then I realized that @simison's second suggestion might actually make sense - just allow anything until whitespace (and stuff like ? or ) so that people can naturally insert the links in text). Thus, the regex became a lot simpler:

urlSuffixRegex = /\S*[^\s?!:,.;]/;

This works great, and all tests pass (include the new ones i added with some international URLs).

However, not everybody uses ? for a question mark. For example, the Greeks use a character that looks exactly like a semicolon. We should chop that off too. But not just any punctuation, because all kinds of other punctuation make perfect inside a URL, in practice.

So, what to do? I'm a bit stuck here.

side note

I think adding an option with a list of "allowed extra characters" does not cut it. Autolinker should be usable on sites that target the entire world.

simison commented 8 years ago

It's perfectly valid to have ?! at URL, e.g.:

eteeselink commented 8 years ago

Indeed, and both regexes in my post allow that, as well as an variation where we replace a-zA-Z0-9 by a whole bunch of unicode character classes.

The issue is heuristics. http://www.google.com?! is a perfectly valid URL, but it's much more commonly used in human-written texts like "Did you try http://www.google.com?!" and in that case you don't want the ?! included in the link. One of the core features of Autolinker is that it does this right: it parses www.google.com?! into www.google.com and leaves the ?! outside the link, but www.google.com?!bla=zork is linked entirely.

So if we make Autolinker support other languages than English, I guess we need the same heuristics: "assume that most punctuation is a valid part of a URL, but let's exclude punctuation just before whitespace, if it's the kind of punctuation that commonly appears at the end of sentences". I have no idea what kind of punctuation people use to end sentences worldwide, and I strongly doubt there's a list of that somewhere.

So, ideas? :-)

simison commented 8 years ago

Making it an exception not to include ?! into the URL if they're at the end of a string sounds interesting but I'd like to point out that it's still perfectly valid there.

You could have URLs like http://foo.bar?question=What+if?

So it boils down to if we'd expect to have more strings like "are you on twitter.com?" or valid urls with question marks at the end of string. In either case we'd do broken links the way or the other. :-)

That said, including question mark to the URL doesn't usually break the link, it would just look funny but work. Then it would be only ! URLs that would be broken:

twitter.com? twitter.com!

eteeselink commented 8 years ago

Good point! However, I think Autolinker was designed like this on purpose. I guess that there's a lot more sentences that end with URLs than there are URLs that end with punctuation. This is why most email programs and apps (including GitHub right here in the issues) do it this way.

So as far as I can tell, it becomes a choice between:

  1. Support international URLs, but drop support for sentences that end with URLs
  2. Support international URLs, but drop support for ASCII URLs in sentences that end with non-English punctuation (e.g. http://www.google.com; - that ; is a greek question mark, not a semicolon)
  3. Somehow find all possible punctuation that ends sentences, anywhere
  4. Just drop this attempt altogether.

I prefer 3 but don't know how. I'm also quite OK with 2 but in all honestly I don't know the implications. Any opinions?

simison commented 8 years ago

@gregjacobs whaddyathink?

simison commented 8 years ago

Just stumbled upon url-regex package — works for non-latin alphabet, too.

Here: https://github.com/regexps/url-regex/blob/master/index.js

eteeselink commented 8 years ago

They basically do what I called "Attempt 2" above:

https://github.com/regexps/url-regex/blob/master/index.js#L14

So, they match any non-whitespace for the path component. They don't deal with trailing punctuation at all, if you give url-regex an string like "Did you try http://github.com?" it'll match http://github.com?. Autolinker avoids this and matches http://github.com. I believe that Autolinker does it right here.

gregjacobs commented 8 years ago

Hey guys. So I finally got around to attempting to implement this. After much research, I ended up allowing the set of "letter or number" characters from unicode in any language (given by the XRegExp \p{L} and \p{Nd} escapes), along with the regular set of special characters allowed in a URL. Hopefully this does the trick! Added in 0.24.0

eteeselink commented 8 years ago

@gregjacobs awesome! https://github.com/gregjacobs/Autolinker.js/blob/4f279e15bf8cb7279d6566c03a279af08ee92051/src/RegexLib.js#L29 is the line i couldn't figure out / find anywhere.

Thanks man :-)

feridmovsumov commented 8 years ago

Thank you :+1: