vmg / rinku

Autolinking. Ruby. Yes, that's pretty much it.
ISC License
597 stars 67 forks source link

Certain UTF-8 characters are picked up as link ending #50

Closed Holek closed 8 years ago

Holek commented 9 years ago

Say I have a Unicode link: https://pl.wikipedia.org/wiki/Komisja_śledcza_do_zbadania_sprawy_zarzutu_nielegalnego_wywierania_wpływu_na_funkcjonariuszy_policji,_służb_specjalnych,_prokuratorów_i_osoby_pełniące_funkcje_w_organach_wymiaru_sprawiedliwości

Characters I get for this string is:

>> content
=> "https://pl.wikipedia.org/wiki/Komisja_śledcza_do_zbadania_sprawy_zarzutu_nielegalnego_wywierania_wpływu_na_funkcjonariuszy_policji,_służb_specjalnych,_prokuratorów_i_osoby_pełniące_funkcje_w_organach_wymiaru_sprawiedliwości"
>> content.encoding
=> #<Encoding:UTF-8>
>> content.bytes
=> [104, 116, 116, 112, 115, 58, 47, 47, 112, 108, 46, 119, 105, 107, 105, 112, 101, 100, 105, 97, 46, 111, 114, 103, 47, 119, 105, 107, 105, 47, 75, 111, 109, 105, 115, 106, 97, 95, 197, 155, 108, 101, 100, 99, 122, 97, 95, 100, 111, 95, 122, 98, 97, 100, 97, 110, 105, 97, 95, 115, 112, 114, 97, 119, 121, 95, 122, 97, 114, 122, 117, 116, 117, 95, 110, 105, 101, 108, 101, 103, 97, 108, 110, 101, 103, 111, 95, 119, 121, 119, 105, 101, 114, 97, 110, 105, 97, 95, 119, 112, 197, 130, 121, 119, 117, 95, 110, 97, 95, 102, 117, 110, 107, 99, 106, 111, 110, 97, 114, 105, 117, 115, 122, 121, 95, 112, 111, 108, 105, 99, 106, 105, 44, 95, 115, 197, 130, 117, 197, 188, 98, 95, 115, 112, 101, 99, 106, 97, 108, 110, 121, 99, 104, 44, 95, 112, 114, 111, 107, 117, 114, 97, 116, 111, 114, 195, 179, 119, 95, 105, 95, 111, 115, 111, 98, 121, 95, 112, 101, 197, 130, 110, 105, 196, 133, 99, 101, 95, 102, 117, 110, 107, 99, 106, 101, 95, 119, 95, 111, 114, 103, 97, 110, 97, 99, 104, 95, 119, 121, 109, 105, 97, 114, 117, 95, 115, 112, 114, 97, 119, 105, 101, 100, 108, 105, 119, 111, 197, 155, 99, 105]
>> content.bytes.map(&:chr)
=> ["h", "t", "t", "p", "s", ":", "/", "/", "p", "l", ".", "w", "i", "k", "i", "p", "e", "d", "i", "a", ".", "o", "r", "g", "/", "w", "i", "k", "i", "/", "K", "o", "m", "i", "s", "j", "a", "_", "\xC5", "\x9B", "l", "e", "d", "c", "z", "a", "_", "d", "o", "_", "z", "b", "a", "d", "a", "n", "i", "a", "_", "s", "p", "r", "a", "w", "y", "_", "z", "a", "r", "z", "u", "t", "u", "_", "n", "i", "e", "l", "e", "g", "a", "l", "n", "e", "g", "o", "_", "w", "y", "w", "i", "e", "r", "a", "n", "i", "a", "_", "w", "p", "\xC5", "\x82", "y", "w", "u", "_", "n", "a", "_", "f", "u", "n", "k", "c", "j", "o", "n", "a", "r", "i", "u", "s", "z", "y", "_", "p", "o", "l", "i", "c", "j", "i", ",", "_", "s", "\xC5", "\x82", "u", "\xC5", "\xBC", "b", "_", "s", "p", "e", "c", "j", "a", "l", "n", "y", "c", "h", ",", "_", "p", "r", "o", "k", "u", "r", "a", "t", "o", "r", "\xC3", "\xB3", "w", "_", "i", "_", "o", "s", "o", "b", "y", "_", "p", "e", "\xC5", "\x82", "n", "i", "\xC4", "\x85", "c", "e", "_", "f", "u", "n", "k", "c", "j", "e", "_", "w", "_", "o", "r", "g", "a", "n", "a", "c", "h", "_", "w", "y", "m", "i", "a", "r", "u", "_", "s", "p", "r", "a", "w", "i", "e", "d", "l", "i", "w", "o", "\xC5", "\x9B", "c", "i"]

"ą" character here is split in two bytes: ["\xC4", "\x85"]. \x85 is a new-line character in UTF-8, so I assume that it's interpreted as such, rather than a part of multi-byte sequence.


Maybe #39 is connected?

vmg commented 9 years ago

Hmpf. This is legit. Let me see if I can find some time this week to work on this -- I think Rinku is due for some UTF8 handling overhaul.

cantino commented 9 years ago

Rinku is generating invalid UTF-8 strings for us as well. We can give it valid_encoding? utf-8 inputs and receive invalid outputs.

cantino commented 9 years ago

Here's a reproduction case:

content = "&lt;a href=&#39;http://gi.co&#39;&gt;gi.co&lt;/a&gt; r" # the last whitespace is a char 160 non-breaking space, don't copy from GitHub formatted output, click 'edit'
p content[-2].codepoints
# => [160]
p content.valid_encoding?
# => true
p Rinku.auto_link(content).valid_encoding?
# => false