vmg / rinku

Autolinking. Ruby. Yes, that's pretty much it.
ISC License
599 stars 67 forks source link

Extract whitespace checking into helper method #51

Closed kerrizor closed 8 years ago

kerrizor commented 8 years ago

isspace will consider different sets of characters as whitespace depending on C locale. Usually this is fine, but the main problem character is a non-breaking space (Unicode: U+00A0). This abstracted helper method checks both isspace to be true OR for the data char to be a NBS (typed on an OSX US keyboard with option-space.)

tenderlove commented 8 years ago

I don't see anything technically wrong with this. 👍

kerrizor commented 8 years ago

@tenderlove thanks! Technically, it seemed straight-forward enough, but I appreciate the style feedback.

kerrizor commented 8 years ago

@vmg oh, I didn't bump the gem version, pending getting this settled, but will own either adding that here or doing a follow-on PR.

vmg commented 8 years ago

Nice start here, @kerrizor! You've got this building and you've fixed this real clean!

We have a bit of a small (actually really large :)) issue here though. I don't think @aroben's analysis was actually 100% correct like I told you initially. Which is a good thing, because this means this small bugfix is actually quite large, and it's gonna become a cool project and learning experience for you! Yey!

Here's the deal: the non-breaking space we're trying to fix here is outside the ASCII-7bit range. U+00A0 is the 160th codepoint, which is clearly outside the 7-bit (127+) range.

This means that if what you're passing onto Rinku is an ASCII string, NBsp is going to be encoded as a single byte (ISO/IEC 8859) with value A0. That's the byte against which you're comparing in this PR. And that's why it's working just fine for that case.

However: if the input string into Rinku is encoded as UTF8 (like all the strings from our App will be), you know what happens to characters outside of the 7-bit range... The non-breaking space will be split into a leading byte, and a continuation byte. In the raw byte stream, it'll look like 0xC2, 0xA0 (two bytes instead of one). Such are the wonders of UTF8.

This is dangerous though! The low byte in this case is still 0xA0, which is what your PR is testing against... But because of that, we'll be leaving the 0xC2 hanging around as part of the URL, and generate an autolinked string that is actually not valid UTF8!!! (the 0xC2 needs to be followed by a continuation byte).

Ouch! Encodings are so tricky. What are we going to do here? Well, easier said than done: we need to refactor Rinku so it understands UTF8 and performs autolinking at codepoint boundaries instead of at the byte-boundary like it does right now.

The good news is that this will also fix several long-standing bugs which arise from the fact that Rinku is not UTF8 aware. I'll pull the full list of bugs next week, but they're all essentially "Rinku looks at single bytes instead of UTF8 codepoints, so character X is actually read as Y".

It's a bit of a refactoring, but it shouldn't take more than a week to get right. I hope you'll find this an interesting exercise. I'm gonna be here throughout the week to pair and review your code.

Now, how are we gonna accomplish this? Well, the good news is that I already wrote a set of APIs that do just this for the CommonMark C Markdown parser (which also needed to be UTF-8 aware). You can find my implementation at https://github.com/jgm/cmark/blob/master/src/utf8.c (it's permissively licensed so you can just copy and paste it here).

You'll see we don't need the case folding code (so you can remove that from the file), and it already comes with very handy helper functions which we'll need (cmark_utf8proc_is_space and cmark_utf8proc_is_punctuation). The most important API, of course, will be cmark_utf8proc_iterate, which will read whole codepoints off a char * buffer like the one we already have in Rinku.

Porting Rinku to use these APIs is, of course, left as an exercise to the reader. Let's see how far you get on Monday. I'll be back on Tuesday to review and pair with you -- hopefully to wrap this up. Please do not get frustrated if you cannot finish this in a single day, or if you get stuck. I'll promise you're gonna get a really solid understanding of the problem after two pairing sessions.

kerrizor commented 8 years ago

@vmg thanks for the write up! I was wondering myself if this was the "right" way to handle the character (and really, any character outside the set..) so I'm glad I wasn't just being paranoid. :) I'll take a run at porting that code in and see where we end up.