vmg / rinku

Autolinking. Ruby. Yes, that's pretty much it.
ISC License
597 stars 67 forks source link

Rinku slices UTF-8 NO-BREAK SPACE to pieces #39

Closed Peeja closed 8 years ago

Peeja commented 10 years ago

NO-BREAK SPACE is Unicode code point A0. In UTF-8, it's encoded as C2 A0. When those bytes come at the end of a URL, Rinku is chopping them up, making the C2 part of the URL and the A0 part of text after the link, resulting in illegal UTF-8.

Rinku.auto_link("http://google.com/\xC2\xA0")
# => "<a href=\"http://google.com/\xC2\">http://google.com/\xC2</a>\xA0"

Oddly, this does not happen with INVERTED EXCLAMATION MARK, the very next Unicode code point (A1):

Rinku.auto_link("http://google.com/\xC2\xA1")
# => "<a href=\"http://google.com/¡\">http://google.com/¡</a>"

Here, Runku has included the INVERTED EXCLAMATION MARK as part of the URL. I think it would be better logic to parse it as being after the URL—but regardless, it doesn't split the bytes apart.


This is environment dependent. I'm running Ruby 2.0.0-p353. I'm running the same version on Heroku, where I don't see this issue:

Rinku.auto_link("http://google.com/\xC2\xA0")
# => "<a href=\"http://google.com/ \">http://google.com/ </a>"