Closed spladug closed 12 years ago
Just saw Houdini. Would that have been preferable? :)
Heh, I was just going to comment on that -- we're crafting Houdini with a lot of care to be fully RFC-compliant, and so far it's 100% compatible with the Ruby/Rails URL escaping methods.
I'm gonna roll with that (if anything to be consistent with all our escaping internally) and see how it goes. I'll keep you updated.
Awesome, look forward to it. Thanks for everything :)
Right on, I've just rolled out escaping using Houdini's escape_uri
. Can you try it out and tell me how it's working for you?
The thing is that we tried to follow the RFC standard as closely as possible, and we're 100% compatible with Ruby's URI.escape
, but for some strange reason, the hash #
is considered an unsafe character in the standard, and as such gets escaped -- I don't know if this behavior is right. What do you think?
/cc @brianmario on this thing.
@tanoku, you're blazing fast :D
It appears this only URI escapes link hrefs; this means naked ampersands can sneak through since they're not entity-encoded.
What does escaping /
do? It's kinda noisy.
Escaping #
seems to work ok in limited testing.
OK, I'm not following you. Can you explain this in more detail? ^^
this means naked ampersands can sneak through since they're not entity-encoded.
Yes. I guess this means we need both HTML-level escaping and URL-level escaping. I.e. we first escape all the entities that cannot be represented in an URI (using the %xx
URI escaping), and then we escape on top of that the characters that are valid in an URI, but could lead to JS injection in the HTML, using the HTML entity escaping &ent;
. Is this right?
What does escaping / do? It's kinda noisy.
What?
Escaping # seems to work ok in limited testing.
No, not really... It gets URI-escaped to %23
, which the browsers consider part of the URI and not an URI delimiter like it should be.
and then we escape on top of that the characters that are valid in an URI, but could lead to JS injection in the HTML, using the HTML entity escaping &ent;. Is this right?
It's not just JS injection, but valid markup. Particularly the naked &s which make dangling entity references. In the pre-houdini setup (e.g. the patch this pull request contains) the only character to be entity-encoded that wasn't already taken care of by percent encoding was the & so I just special cased it in the percent encoding function. But yes, that sounds correct.
What?
What does turning /
into /
achieve? I've never seen that done before.
No, not really...
Interesting. I threw it into Firefox with a percent-encoded #
and it seemed to work just fine. Maybe just an artifact of how I tested.
What does turning / into / achieve? I've never seen that done before.
It's a recommended practice by the OWASP: the slash can be used to close HTML tags. Go figure. Either way, that's toggleable in Houdini (see the secure
flag in houdini_escape_html0
). Would you prefer we keep it disabled?
Interesting. I threw it into Firefox with a percent-encoded # and it seemed to work just fine. Maybe just an artifact of how I tested.
I've just double checked: https://github.com/tanoku/sundown/pull/71%23issuecomment-2047885
That URL should point to your comment, but obviously doesn't work because the hash is considered as part of the URI, not as an anchor separator. This is weird as fuck. I hate standards!
I think I'm just going to port your code to Houdini and put it back here again. Question: do you know of any other library calls that do this "half URI/half HTML entity" escaping for hrefs?
It's a recommended practice by the OWASP: the slash can be used to close HTML tags. Go figure. Either way, that's toggleable in Houdini (see the secure flag in houdini_escape_html0). Would you prefer we keep it disabled?
Ah, cool. Just curious. It makes sense to default to the most secure option, so I don't personally mind it being on by default.
This is weird as fuck. I hate standards!
;)
Question: do you know of any other library calls that do this "half URI/half HTML entity" escaping for hrefs?
None that I've seen. It's a bit of an ugly way to do it to be honest, but it just felt like such overkill to do a second pass over the string to replace one character. :)
Alright, check out the new version I've just pushed and tell me how it's looking. I think I'm handling all corner cases now. Cheers!
That looks great. Thanks a bunch :)
Characters outside the allowed range in a URL should be percent encoded according to RFC 3986. This patch adds a new function
sdhtml_urlencode
that percent encodes all characters outside the allowed set and entity-escapes ampersands.The three other entity-escaped characters (<, >, and ") are covered by percent encoding and escaping ampersands in
sdhtml_urlencode
means thatsdhtml_escape
can be skipped for URL fields.I considered not using
snprintf
but benchmarking against 500K of reddit's comments showed no noticeable performance hit for doing it this way. (It did, however, continue to show that this library is damned fast! :)