s9e / TextFormatter

Text formatting library that supports BBCode, HTML and other markup via plugins. Handles emoticons, censors words, automatically embeds media and more.
MIT License
232 stars 35 forks source link

Autolink: add support for fullwidth and halfwidth punctuation #42

Closed JoshyPHP closed 8 years ago

JoshyPHP commented 8 years ago

Related to this: https://github.com/flarum/core/issues/1041 Also this: https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

Possible test case:

“极·致·轻”,没有如果,你就是英雄!车库源码(http://src.cool)与您共同进步!
JoshyPHP commented 8 years ago

@dcsjapan If you encounter real-world examples of non-ASCII punctuation that interferes with links feel free to reopen this issue by posting them here.

dcsjapan commented 8 years ago

I'll keep an eye out for examples. It's hard to actively search for these things, because Google and the like tend to either ignore fullwidth symbols or lump them together with their halfwidth counterparts.

I am wondering if simply excluding fullwidth punctuation from links is a complete solution, though. It's possible there may be URLs which include such characters. I noticed Wikipedia has chosen to avoid the problem by insisting on halfwidth parentheses for disambiguation, e.g.

https://ja.wikipedia.org/wiki/サクラ_(曖昧さ回避)

https://zh.wikipedia.org/wiki/道_(消歧义)

But when I wondered whether they could've used fullwidth parens instead ... just limiting myself to Japanese for the sake of simplicity ... all I come up with is the short list of halfwidth symbols that can't be used in URLs:

\  '  |  `  ^  "  <  >  )  (  }  {  ]  [

So if we put Wikipedia's conventions aside for the moment, it seems that URLs such as the following should in theory be possible:

https://ja.wikipedia.org/wiki/サクラ(曖昧さ回避

https://zh.wikipedia.org/wiki/道(消歧义

And indeed, Wikipedia treats those as valid (albeit empty) pages.

That being the case, you'd need to test whether the "" corresponding to each "" came before or after the "http", or you run the risk of truncating the link. Likewise for other types of brackets. What's worse, you could presumably also have URLs including things like fullwidth commas, full stops, question marks, and exclamation points ... not to mention a wide variety of Eastern emoticons ... and there'd be no way to tell whether a given symbol marks the end of the URL or not.