Percent-encode disallowed characters in URLs.

spladug commented 12 years ago

Characters outside the allowed range in a URL should be percent encoded according to RFC 3986. This patch adds a new function sdhtml_urlencode that percent encodes all characters outside the allowed set and entity-escapes ampersands.

The three other entity-escaped characters (<, >, and ") are covered by percent encoding and escaping ampersands in sdhtml_urlencode means that sdhtml_escape can be skipped for URL fields.

I considered not using snprintf but benchmarking against 500K of reddit's comments showed no noticeable performance hit for doing it this way. (It did, however, continue to show that this library is damned fast! :)

spladug commented 12 years ago

Just saw Houdini. Would that have been preferable? :)

vmg commented 12 years ago

Heh, I was just going to comment on that -- we're crafting Houdini with a lot of care to be fully RFC-compliant, and so far it's 100% compatible with the Ruby/Rails URL escaping methods.

I'm gonna roll with that (if anything to be consistent with all our escaping internally) and see how it goes. I'll keep you updated.

spladug commented 12 years ago

Awesome, look forward to it. Thanks for everything :)

vmg commented 12 years ago

Right on, I've just rolled out escaping using Houdini's escape_uri. Can you try it out and tell me how it's working for you?

The thing is that we tried to follow the RFC standard as closely as possible, and we're 100% compatible with Ruby's URI.escape, but for some strange reason, the hash # is considered an unsafe character in the standard, and as such gets escaped -- I don't know if this behavior is right. What do you think?

vmg commented 12 years ago

/cc @brianmario on this thing.

spladug commented 12 years ago

@tanoku, you're blazing fast :D

It appears this only URI escapes link hrefs; this means naked ampersands can sneak through since they're not entity-encoded.

What does escaping / do? It's kinda noisy.

Escaping # seems to work ok in limited testing.

vmg commented 12 years ago

OK, I'm not following you. Can you explain this in more detail? ^^

this means naked ampersands can sneak through since they're not entity-encoded.

Yes. I guess this means we need both HTML-level escaping and URL-level escaping. I.e. we first escape all the entities that cannot be represented in an URI (using the %xx URI escaping), and then we escape on top of that the characters that are valid in an URI, but could lead to JS injection in the HTML, using the HTML entity escaping &ent;. Is this right?

What does escaping / do? It's kinda noisy.

What?

Escaping # seems to work ok in limited testing.

No, not really... It gets URI-escaped to %23, which the browsers consider part of the URI and not an URI delimiter like it should be.

spladug commented 12 years ago

and then we escape on top of that the characters that are valid in an URI, but could lead to JS injection in the HTML, using the HTML entity escaping &ent;. Is this right?

It's not just JS injection, but valid markup. Particularly the naked &s which make dangling entity references. In the pre-houdini setup (e.g. the patch this pull request contains) the only character to be entity-encoded that wasn't already taken care of by percent encoding was the & so I just special cased it in the percent encoding function. But yes, that sounds correct.

What?

What does turning / into / achieve? I've never seen that done before.

No, not really...

Interesting. I threw it into Firefox with a percent-encoded # and it seemed to work just fine. Maybe just an artifact of how I tested.

vmg commented 12 years ago

What does turning / into / achieve? I've never seen that done before.

It's a recommended practice by the OWASP: the slash can be used to close HTML tags. Go figure. Either way, that's toggleable in Houdini (see the secure flag in houdini_escape_html0). Would you prefer we keep it disabled?

Interesting. I threw it into Firefox with a percent-encoded # and it seemed to work just fine. Maybe just an artifact of how I tested.

I've just double checked: https://github.com/tanoku/sundown/pull/71%23issuecomment-2047885

That URL should point to your comment, but obviously doesn't work because the hash is considered as part of the URI, not as an anchor separator. This is weird as fuck. I hate standards!

I think I'm just going to port your code to Houdini and put it back here again. Question: do you know of any other library calls that do this "half URI/half HTML entity" escaping for hrefs?

spladug commented 12 years ago

It's a recommended practice by the OWASP: the slash can be used to close HTML tags. Go figure. Either way, that's toggleable in Houdini (see the secure flag in houdini_escape_html0). Would you prefer we keep it disabled?

Ah, cool. Just curious. It makes sense to default to the most secure option, so I don't personally mind it being on by default.

This is weird as fuck. I hate standards!

;)

Question: do you know of any other library calls that do this "half URI/half HTML entity" escaping for hrefs?

None that I've seen. It's a bit of an ugly way to do it to be honest, but it just felt like such overkill to do a second pass over the string to replace one character. :)

vmg commented 12 years ago

Alright, check out the new version I've just pushed and tell me how it's looking. I think I'm handling all corner cases now. Cheers!

spladug commented 12 years ago

That looks great. Thanks a bunch :)

vmg / sundown

Percent-encode disallowed characters in URLs. #71