openresty / lua-nginx-module

Embed the Power of Lua into NGINX HTTP servers
https://openresty.org/
11.31k stars 2.04k forks source link

ngx.escape_uri should escape RFC 3986 section 2.2 Reserved Characters #1124

Open hunterli opened 7 years ago

hunterli commented 7 years ago

$ python -c "import urllib; print urllib.quote(\"'\")" %27

$ resty -e "ngx.say(ngx.escape_uri(\"'\"))" ' ngx.escape_uri should escape "'" to %27 but not.

$resty -v resty 0.17 nginx version: openresty/1.11.2.3 built by gcc 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) built with OpenSSL 1.0.2k 26 Jan 2017 TLS SNI support enabled configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt=-O2 --add-module=../ngx_devel_kit-0.3.0 --add-module=../echo-nginx-module-0.60 --add-module=../xss-nginx-module-0.05 --add-module=../ngx_coolkit-0.2rc3 --add-module=../set-misc-nginx-module-0.31 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.06 --add-module=../srcache-nginx-module-0.31 --add-module=../ngx_lua-0.10.8 --add-module=../ngx_lua_upstream-0.06 --add-module=../headers-more-nginx-module-0.32 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.18 --add-module=../redis2-nginx-module-0.14 --add-module=../redis-nginx-module-0.3.7 --add-module=../rds-json-nginx-module-0.14 --add-module=../rds-csv-nginx-module-0.07 --with-ld-opt=-Wl,-rpath,/usr/local/openresty/luajit/lib --with-openssl=/tmp/openssl-1.0.2k --with-pcre=/tmp/pcre-8.39 --with-file-aio --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_geoip_module=dynamic --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module=dynamic --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-http_xslt_module=dynamic --with-ipv6 --with-mail --with-mail_ssl_module --with-md5-asm --with-pcre-jit --with-sha1-asm --with-stream --with-stream_ssl_module --with-threads

agentzh commented 7 years ago

@hunterli Will you contribute a pull request for it? Thanks!

hunterli commented 7 years ago

@agentzh OK, I'll try it.

agentzh commented 7 years ago

@hunterli Great! Looking forward to it!

spacewander commented 7 years ago

I am afraid I don't think it is a bug.

According to the document, ngx.escape_uri is used to:

Escape str as a URI component.

Although its name is escape_uri.

Escape uri is different from escaping URI component. The second one doesn't need to escape '. Refer to RFC 3986 section 2.2:

A subset of the reserved characters (gen-delims) is used as delimiters of the generic URI components described in Section 3

Note that the RFC only requires the gen-delims part of reserved characters. Not all the reserved characters.

What urllib.quote does is more like escaping uri instead of escaping URI component. According to its document, urllib.quote is used to:

By default, this function is intended for quoting the path section of the URL.

Don't mix up these two differnt things.

hunterli commented 7 years ago

@spacewander FYI: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent said: To be more stringent in adhering to RFC 3986 (which reserves !, ', (, ), and *), even though these characters have no formalized URI delimiting uses, the following can be safely used

spacewander commented 7 years ago

@hunterli Could you point out where RFC 3986 says that !, ', (, ), and * are reserved for URI component?

spacewander commented 7 years ago

One place I found including !, ', (, ), and * is the D.2. Modifications, under Appendix D. Changes from RFC 2396:

The mark characters that are typically unsafe to decode, including the exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open and close parentheses ("(" and ")"), have been moved to the reserved set in order to clarify the distinction between reserved and unreserved and, hopefully, to answer the most common question of scheme designers.

However, it just says ' is reserved compared with RFC 2396, doesn't say it is reserved for URI component.

According to Appendix A. Collected ABNF for URI, ' is one of the sub-delims, which could be a part of query or fragment.

hunterli commented 7 years ago

@spacewander I test url encode function in php and lua, both of them output %27. So may be compatible with other programming language to escape sub-delims is a acceptable answer? (Sorry for poor English)

php -r "echo urlencode(\"test'test\");" test%27test FYI: http://lua-users.org/wiki/StringRecipes

URL-encode a string

function urlencode(str) if (str) then str = string.gsub (str, "\n", "\r\n") str = string.gsub (str, "([^%w %-%%.%~])", function (c) return string.format ("%%%02X", string.byte(c)) end) str = string.gsub (str, " ", "+") end return str
end

spacewander commented 7 years ago

Why not be compatible with Javascript? It doesn't escape ', too. And it doesn't strictly follow the RFC 3986, either. Do you like PHP more than Javascript? 😄

Another option is do what Go developers did: they just simply said:

we've accepted that we can't fully implement the RFC at this point for compatibility reasons.

Luckily, our previous implementation is more compatible with RFC 3986. Why should we break the compatibility with RFC 3986 and break the compatibility with former code at the same time?

If your target environment requires special encode rule, just do it in your code.

hunterli commented 7 years ago

@spacewander I "Try Go" in https://golang.org:

package main

import( "fmt" "net/url" )

func main() { fmt.Println(url.QueryEscape("'")) }

output:%27

hunterli commented 7 years ago

@spacewander In RFC 3986 2.2 Reserved Characters:

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

For "&" and "'" both in sub-delims, ngx.escape_uri encode "&", why not "'"?

spacewander commented 7 years ago

@hunterli

  1. Go's implementation doesn't follow the RFC 3986. And for backward compability, they don't want to fix it. See https://github.com/golang/go/issues/16127.
  2. The original ngx.escape_uri doesn't strictly follow the RFC 3986. However, escape other sub-delims will make it less compatible with the RFC 3986 and also break the backward compability.
hunterli commented 7 years ago

被你弄糊涂了,Sorry,中文直接上了,看不懂就放弃了。也可能是我理解错误,RFC说"Reserved Characters"都应该escape,"Reserved Characters"包含gen-delims和sub-delims,这样的话gen-delims和sub-delims都应该escape,否则就是"白马非马"了。

spacewander commented 7 years ago

RFC说"Reserved Characters"都应该escape

Could you point out where RFC 3986 says that all reserved characters should be escaped for URI component? Refer to Appendix A. Collected ABNF for URI, sub-delims could be a valid part of some URI components.

hunterli commented 7 years ago

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.

这个是什么意思呀?通常的用户不是这样用么: http://example.com/query?test=ngx.escape("test=&'test") 如果"&","=","'"之类做为分隔符,就程序员自己放在ngx.escape函数外面吧?

Appendix A. 里面说:

query = *( pchar / "/" / "?" )

那么"&","="都不应该escape?能这样理解么? http://example.com/query?test=value_part1=value_part2&value_part3=foo test的值是:value_part1=value_part2&value_part3=foo 这样的URL是正确可解析的?很多WebServer应该都不这样理解吧。

monkeyDluffy6017 commented 1 year ago

Why does the openresty implementation not match nginx? Migrating from nginx can be confusing I won't discuss whose implementation follows rfc, but I think openresty should try to be consistent with nginx

For openresty:

image

For nginx:

image