Closed szmarczak closed 4 months ago
@yusukebe, do you think that it is a bug?
Hi @szmarczak
The router doesn't need to decode the path. Instead, if you want to decode it, you can use the getPath
option for new Hono()
.
import { Hono } from 'hono'
const app = new Hono({
getPath: (req) => {
const url = req.url
const queryIndex = url.indexOf('?', 8)
const path = url.slice(url.indexOf('/', 8), queryIndex === -1 ? undefined : queryIndex)
return decodeURI(path)
},
})
I disagree. The params are decoded out-of-the-box (though in Request), I don't see why the path can't be:
Deno/Hono
import { Hono } from 'npm:hono';
const h = new Hono();
h.get('/:foobar', async (c) => c.text(c.req.param('foobar')));
Deno.serve({
port: 8080,
}, h.fetch);
Node
http.get({ host: 'localhost', port: 8080, path: '/'+encodeURIComponent('/') }, res => res.pipe(process.stdout));'';
Also, the benchmark for find-my-way
uses find
which does decoding.
@szmarczak
Thanks. Indeed, paths may need to be decoded. But I think it should not be done by the router but by getPath()
. We have made the method for that purpose. Regarding the benchmark, we can add a note on the page.
@usualoma What do you think about this?
should not be done by the router but by getPath()
There's no reason to make decoding opt-in. It's bad practice and bad DX.
Furthermore, the URI RFC says:
Percent-encoded octets must be decoded at some point during the dereference process. Applications must split the URI into its components and subcomponents prior to decoding the octets, as otherwise the decoded octets might be mistaken for delimiters.
so not decoding the components is actually against the spec.
I think the paths should be decoded by default, making decoding optional can lead to errors and poor developer experience (DX) because it introduces the possibility of inconsistent handling of URIs, I believe automatically decoding URI components ensures consistency and reduces the likelihood of bugs related to misinterpretation of encoded characters.
I think it should not be done by the router but by
getPath()
I think we can add an option in Hono
instance to disable decoding?
But default is enabled.
I considered it for a while and concluded that we should decode the path component before processing the routing.
As @yusukebe says, the place to do this processing is getPath()
. We need an implementation of getPath()
that is fast and does the decoding correctly.
@fzn0x @usualoma Thanks.
Let's implement getPath()
to decode a URL path correctly and make it the default.
@yusukebe, can I implement it?
the place to do this processing is getPath()
Why? There have been several comments saying that with no rationale behind.
@szmarczak Thanks for your comment.
Are you saying that you think the router should do this process?
We believe that modifying getPath()
would satisfy the specification. Please be specific about what you are objecting to.
@szmarczak
Oh, and to be clear, the "modifying getPath()
" we are talking about now refers to the following changes.
We want to implement this change in a less performance-degrading way.
diff --git a/src/utils/url.ts b/src/utils/url.ts
index b37a5539..7109a307 100644
--- a/src/utils/url.ts
+++ b/src/utils/url.ts
@@ -73,7 +73,7 @@ export const getPath = (request: Request): string => {
// Optimized: indexOf() + slice() is faster than RegExp
const url = request.url
const queryIndex = url.indexOf('?', 8)
- return url.slice(url.indexOf('/', 8), queryIndex === -1 ? undefined : queryIndex)
+ return decodeURI(url.slice(url.indexOf('/', 8), queryIndex === -1 ? undefined : queryIndex))
}
export const getQueryStrings = (url: string): string => {
Are you saying that you think the router should do this process?
Correct. For example, testing this behavior would be much easier. Otherwise, you'd have to test the entire Hono app to make sure the behavior is right.
We believe that modifying getPath() would satisfy the specification.
Indeed, however the router can be used separately outside of Hono. Making it a user responsibility would be a bad DX. For example, if I use a router then I'd expect it to parse paths for me without the need to do anything in that regards from my side. If this is true when reading path params, then it should be true for the rest of the path.
Please be specific about what you are objecting to.
Having said the above, I don't believe that making it a part of getPath
is a good idea.
We want to implement this change in a less performance-degrading way.
path.includes('%') ? decodeURI(path) : path
Also note that decodeURI
may throw if it contains non-UTF-8 data so perhaps it's good to wrap it in a safe function that returns unwrapped data instead.
@szmarczak Thank you! I think I understand your thoughts. It will be helpful.
I consider the following
The match()
method of Hono's router is responsible for taking a string and returning a result.
https://github.com/honojs/hono/blob/main/src/router.ts#L7-L11
If it is "receive a Request object and return a result", I think it should be decoded at the router, but since it is not, I feel comfortable with the API being "decode and pass it outside the router".
the router can be used separately outside of Hono
Yes, there may be such users.
But I would prefer to improve the DX for the users of Hono (or us, the developers) who implement their own router in Hono, rather than improve the DX for the users who "use the router outside of Hono". If I want to write a new router, I would appreciate it if it is decoded outside the router.
path.includes('%') ? decodeURI(path) : path
I think that snippet will meet the specification, but we need code that performs better. For example, the following code. (Whether it is decodeURI or decodeURIComponent is a separate discussion.)
export const getPath = (request: Request): string => {
const url = request.url
const start = url.indexOf('/', 8)
let i = start
for (; i < url.length; i++) {
const charCode = url.charCodeAt(i)
if (charCode === 37) {
// '%'
// If the path contains percent encoding, use `indexOf()` to find '?' and return the result immediately.
// Although this is a performance disadvantage, it is acceptable since we prefer cases that do not include percent encoding.
const queryIndex = url.indexOf('?', i)
return decodeURIComponent(url.slice(start, queryIndex === -1 ? undefined : queryIndex))
} else if (charCode === 63) {
// '?'
break
}
}
return url.slice(start, i)
}
The match() method of Hono's router is responsible for taking a string
The string consists of URI components.
If it is "receive a Request object and return a result", I think it should be decoded at the router, but since it is not
Whether the URL is extracted from Request
or a user supplies its own request.url
directly (guaranteed it's not decoded already) is irrelevant.
If I want to write a new router, I would appreciate it if it is decoded outside the router.
Then you can extract the custom decoding URI function into fastDecodeURI
and do path = fastDecodeURI(path)
in
and other routers too. IMO in that way DX for Hono developers is preserved.
I think that snippet will meet the specification, but we need code that performs better. For example, the following code.
I haven't benchmarked this but it may occur that for + charCodeAt
is slower than calling always indexOf('?')
and includes('%')
. (Well, after all it's JS and not C++ so often the code that "looks fast" is not). Feel free to check that, I'll be back in the morning as I'm off to sleep.
I think that both sides (me and Hono) have expressed their opinions thoroughly. Whether it's getPath
or router
is a just implementation detail of free choice. It doesn't matter for the Hono end user which way it's done. I don't think I have any other points to add. Thanks for consideration!
Regarding the benchmark, we can add a note on the page.
I think the benchmark should contain code that decodes the URI (using Hono's implementation) for routers that don't decode URIs. Otherwise the benchmark is untrue as it doesn't represent how the route is handled in a real life scenario.
@szmarczak Thanks. I understand that there are other opinions, but I still think that getPath()
should be modified.
I would like to refer to #2688 for discussion.
If anyone has an opinion, please comment.
Is this the expected routing result?
There may be disagreement as to whether http://localhost/users/hono%2Fposts
should match /users/:id
or /users/:id/posts
. However, leaving "%2F" here (by decodeURI()
) would complicate the "do not double-decode" process, so I think it is better to decode everything (by decodeURIComponent()
) here. The current v4.3.7, before the change, matches /users/:id
, so this is a spec change.
It's a bit of a quibble whether http://localhost/users/%73tatic
should match /users/static
, but it might slip through and match '/users/:id'
and c.req.param ('id')
returns "static"
, then there may be a security problem depending on the application specification, so I think it should match /users/static
. This is also a spec change.
I have benchmarked with benchmarks/utils/src/get-path.ts and I think this implementation will be faster in any runtime.
% bun run src/get-path.ts; deno run --allow-sys src/get-path.ts; npx tsx src/get-path.ts
cpu: Apple M2 Pro
runtime: bun 1.1.2 (arm64-darwin)
benchmark time (avg) (min … max) p75 p99 p999
--------------------------------------------------------------------------- -----------------------------
noop 45 ps/iter (0 ps … 22.16 ns) 41 ps 81 ps 122 ps !
• getPath
--------------------------------------------------------------------------- -----------------------------
slice + indexOf : w/o decodeURI 30.43 ns/iter (28.4 ns … 307 ns) 29.97 ns 36.03 ns 212 ns
regexp : w/o decodeURI 40.58 ns/iter (34.79 ns … 312 ns) 39.67 ns 53.65 ns 285 ns
slice + indexOf 63.21 ns/iter (56.68 ns … 343 ns) 63.99 ns 72.98 ns 280 ns
slice + for-loop + flag 23.35 ns/iter (21.81 ns … 313 ns) 22.66 ns 29.3 ns 211 ns
slice + for-loop + immediate return 23.12 ns/iter (21.67 ns … 235 ns) 22.66 ns 27.3 ns 188 ns
summary for getPath
slice + for-loop + immediate return
1.01x faster than slice + for-loop + flag
1.32x faster than slice + indexOf : w/o decodeURI
1.75x faster than regexp : w/o decodeURI
2.73x faster than slice + indexOf
cpu: Apple M2 Pro
runtime: deno 1.43.1 (aarch64-apple-darwin)
benchmark time (avg) (min … max) p75 p99 p999
--------------------------------------------------------------------------- -----------------------------
noop 698 ps/iter (0 ps … 977 ns) 0 ps 0 ps 0 ps !
• getPath
--------------------------------------------------------------------------- -----------------------------
slice + indexOf : w/o decodeURI 15.12 ns/iter (0 ps … 977 ns) 0 ps 977 ns 977 ns !
regexp : w/o decodeURI 49.17 ns/iter (0 ps … 977 ns) 0 ps 977 ns 977 ns !
slice + indexOf 45.34 ns/iter (0 ps … 977 ns) 0 ps 977 ns 977 ns !
slice + for-loop + flag 34.2 ns/iter (0 ps … 977 ns) 0 ps 977 ns 977 ns !
slice + for-loop + immediate return 31.73 ns/iter (0 ps … 977 ns) 0 ps 977 ns 977 ns !
summary for getPath
slice + indexOf : w/o decodeURI
2.1x faster than slice + for-loop + immediate return
2.26x faster than slice + for-loop + flag
3x faster than slice + indexOf
3.25x faster than regexp : w/o decodeURI
cpu: Apple M2 Pro
runtime: node v20.0.0 (arm64-darwin)
benchmark time (avg) (min … max) p75 p99 p999
--------------------------------------------------------------------------- -----------------------------
noop 71 ps/iter (20 ps … 234 ns) 82 ps 102 ps 143 ps !
• getPath
--------------------------------------------------------------------------- -----------------------------
slice + indexOf : w/o decodeURI 21.29 ns/iter (19.96 ns … 177 ns) 21.18 ns 30.03 ns 40.85 ns
regexp : w/o decodeURI 42.16 ns/iter (37.82 ns … 259 ns) 40.85 ns 105 ns 143 ns
slice + indexOf 34.05 ns/iter (31.74 ns … 146 ns) 34 ns 42.07 ns 125 ns
slice + for-loop + flag 27.9 ns/iter (26.12 ns … 304 ns) 27.73 ns 37.03 ns 140 ns
slice + for-loop + immediate return 27.39 ns/iter (25.45 ns … 192 ns) 27.02 ns 36.44 ns 153 ns
summary for getPath
slice + indexOf : w/o decodeURI
1.29x faster than slice + for-loop + immediate return
1.31x faster than slice + for-loop + flag
1.6x faster than slice + indexOf
1.98x faster than regexp : w/o decodeURI
URIError: URI malformed
Error handling is under consideration.
@usualoma decodeURIComponent
is incorrect. decodeURI
is correct. It must not normalize %2F
into /
(nor %3F
otherwise it will mistake it for a query).
localhost/foo%2Fbar/baz/ąę
consists of only three segment, not four. If the router accepted an array, it would look like this:
['foo/bar', 'baz', 'ąę']
However, if we're using strings, then to match this we need foo%2Fbar/baz/ąę
(note that ą
and ę
are decoded and not in their percent-encoded form).
Handle URIError: URI malformed
Ideally, we'd like to decode what we can, and malformed percent-encoded characters leave in their percent-encoded form. I'm almost certain that this would be much slower, therefore IMO it's ok to fallback to original not-decoded path.
@szmarczak Thanks for the comment!
(nor %3F otherwise it will mistake it for a query).
If you read the code, you will see that we extract the path component and then apply decodeURIComponent
, so there is no confusion with the query.
localhost/foo%2Fbar/baz/ąę consists of only three segment
Can you provide a reason to believe that the specification is correct? I am referring to RFC 3986, 2.3, which states that they should be decoded.
https://datatracker.ietf.org/doc/html/rfc3986#section-2.3
For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.
Also, /a%2Fb.html
is considered the same as /a/b.html
on typical web servers such as nginx.
Can you provide a reason to believe that the specification is correct?
I don't understand the question. The specification is always correct because it is the definition.
I am referring to RFC 3986, 2.3, which states that they should be decoded.
No, it does not state that. 2F
is not in the range of RFC 3986, 2.3
so you're incorrectly adhering to it.
Also, /a%2Fb.html is considered the same as /a/b.html on typical web servers such as nginx.
We're not talking about nginx
or any other servers here.
https://www.rfc-editor.org/rfc/rfc9110.html#name-https-normalization-and-com
Characters other than those in the "reserved" set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [URI]).
https://www.rfc-editor.org/rfc/rfc3986#section-2.2
2.2. Reserved Characters
reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
I hope this clarifies. This means you must not decode %3F
into ?
while normalizing the path segment (normalization happens when we operate on paths like strings, if segments were arrays then decodeURIComponent
would be correct, like said in the comment above).
@szmarczak Sorry, you were right from the beginning on the following points you mentioned and my understanding was wrong. If we look at it in terms of "segments", I think it should be divided into three.
localhost/foo%2Fbar/baz/ąę consists of only three segments, not four. If the router accepted an array, it would look like this: > localhost/foo%2Fbar/baz/ąę
['foo/bar', 'baz', 'ąę']
However, "apply decodeURI() -> split into segments -> apply decodeURIComponent() for each segment" doesn't work.
decodeURI("%2525a/b/c").split("/").map(decodeURIComponent)
The expected result should be as follows,
['%25a', 'b', 'c']
In fact, it would look like this.
['%a', 'b', 'c']
If we want to get the correct result on a per-segment basis, we would need to return an array from getPath() as follows
return "%2525a/b/c".split("/").map(decodeURIComponent)
However, the Hono router does not understand the unit of “segment”.
No worries! I'm glad I could clarify.
However, "apply decodeURI() -> split into segments -> apply decodeURIComponent() for each segment" doesn't work.
[edit: removed because the example is right]
I believe implement memoization could improve the speed. Also skips normal path to fast parse
@fzn0x Thanks for the suggestion!
First of all, I don't think the cache will speed up the process, since we won't be calling getPath() multiple times for the same request object.
Also, I don't think the code you suggested would perform well if it were not cached.
If you still think it performs better, I would appreciate it if you could suggest it along with a benchmark.
FYI
The results of using symbols and multibyte characters in the routing definitions were investigated in the major frameworks. They vary considerably from framework to framework.
I don't think "we should match the major frameworks," but just that "if symbols and multibyte characters are not well supported in the major frameworks, it indicates that no one is having trouble in the production environment.
Symbols and multibytes do not appear to be available. paths are treated as "/"-separated segments.
const express = require('express')
const app = express()
const port = 3333
// "|" has a special meaning, so use "^" to validate
app.get('/^hello^', (req, res) => {
res.send('^hello^')
})
app.get('/hello🔥', (req, res) => {
res.send('hello🔥')
})
app.get('/users/:id/action', (req, res) => {
res.send('users action')
})
app.listen(port, () => {
console.log(`Example app listening on port ${port}`)
})
$ curl 'http://localhost:3333/^hello^'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /%5Ehello%5E</pre>
</body>
</html>
$ curl 'http://localhost:3333/%5Ehello%5E'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /%5Ehello%5E</pre>
</body>
</html>
$ curl 'http://localhost:3333/hello🔥'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /hello%f0%9f%94%a5</pre>
</body>
</html>
$ curl 'http://localhost:3333/hello%f0%9f%94%a5'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /hello%f0%9f%94%a5</pre>
</body>
</html>
$ curl 'http://localhost:3333/users/1%2F2/action' # 200 : id => 1/2
$ curl 'http://localhost:3333/users%2F1%2Faction' # 404
Only percent-encoded URLs are supported; unencoded URLs will result in a 404. paths are treated as "/"-separated segments.
get '/^hello^'
get '/|hello|'
get '/hello🔥'
get '/users/:id/action'
$ curl 'http://localhost/^hello^' # 404
$ curl 'http://localhost/%5Ehello%5E' # 200
$ curl 'http://localhost/|hello|' # 404
$ curl 'http://localhost/%7Chello%7C' # 200
$ curl 'http://localhost/hello🔥' # 404
$ curl 'http://localhost/hello%F0%9F%94%A5' # 200
$ curl 'http://localhost/users/1%2F2/action' # 200
1/2
$ curl 'http://localhost/users%2F1%2Faction' # 200
Before routing, all characters are decoded in a decodeURIComponent() equivalent process, showing similar results to https://github.com/honojs/hono/pull/2688.
path('^hello^', views.index, name='test'),
path('|hello|', views.index, name='test'),
path('hello🔥', views.index, name='test'),
path('a/<slug:slug>/c', views.path, name='test'),
$ curl 'http://localhost/^hello^' # 200
$ curl 'http://localhost/%5Ehello%5E' # 200
$ curl 'http://localhost/|hello|' # 200
$ curl 'http://localhost/%7Chello%7C' # 200
$ curl 'http://localhost/hello🔥' # 200
$ curl 'http://localhost/hello%F0%9F%94%A5' # 200
$ curl 'http://localhost:8000/a%2Fb%2Fc' # 200
slug = b
$ curl 'http://localhost:8000/a/a%2Fb%2Fc/c' # 404
If what we need is a result similar to Ruby on Rails, then the following changes will accomplish this. (Actually, it's a bit more complicated, but it's "encode when adding a routing.")
diff --git a/src/hono-base.ts b/src/hono-base.ts
index 293d3874..d4cb8071 100644
--- a/src/hono-base.ts
+++ b/src/hono-base.ts
@@ -290,7 +290,7 @@ class Hono<
private addRoute(method: string, path: string, handler: H) {
method = method.toUpperCase()
- path = mergePath(this._basePath, path)
+ path = encodeURI(mergePath(this._basePath, path))
const r: RouterRoute = { path: path, method: method, handler: handler }
this.router.add(method, path, [handler, r])
this.routes.push(r)
If what we need is a result similar to Ruby on Rails, then the following changes will accomplish this. (Actually, it's a bit more complicated, but it's "encode when adding a routing.")
diff --git a/src/hono-base.ts b/src/hono-base.ts index 293d3874..d4cb8071 100644 --- a/src/hono-base.ts +++ b/src/hono-base.ts @@ -290,7 +290,7 @@ class Hono< private addRoute(method: string, path: string, handler: H) { method = method.toUpperCase() - path = mergePath(this._basePath, path) + path = encodeURI(mergePath(this._basePath, path)) const r: RouterRoute = { path: path, method: method, handler: handler } this.router.add(method, path, [handler, r]) this.routes.push(r)
Yes, that's what I mean, sorry for the example, I think every framework does different things to deal with paths, but Django looks good!
If the expected behavior is equivalent to Ruby on Rails, I think #2711 can achieve this. I feel this is better than https://github.com/honojs/hono/pull/2688 because it is less burdensome at the request time.
However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".
I created #2714 with great support from @szmarczak!
Although #2711 is hard to discard for its performance advantage, at this point I consider this new #2714 to be the best.
The reasons are as follows
Although I am concerned about the following points.
However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".
Also, there is still the issue of how to handle URIError: URI malformed
. (I think it should return an error.)
@yusukebe Looking at the discussion so far, what are your thoughts?
@szmarczak @usualoma @fzn0x
Thank you for the discussion, and I'm sorry for the misunderstanding at first.
I like #2714! It achieves ideal handling paths, and the performance is good.
URIError: URI malformed
I also think it should return an error. When should it return the error? Can it return an error during the registration phase?
@usualoma If someone sends UTF-8 (non-ASCII) path over the wire, it's incorrect as the URI spec permits only ASCII. UTF-8 characters must be percent-encoded. So the Ruby on Rails
behavior seems most correct.
Although https://github.com/honojs/hono/pull/2711 is hard to discard for its performance advantage, at this point I consider this new https://github.com/honojs/hono/pull/2714 to be the best.
The performance should be the same. The least performant way is when someone sends %25
which is 1 of dozen characters that have to be percent-encoded, so the probability of this path is very low.
Also, there is still the issue of how to handle URIError: URI malformed. (I think it should return an error.)
Well, (unfortunately!) the URI spec does not define the encoding percent-encoded characters have to use (they are just bytes). It's just that the browser APIs like decodeURI
and encodeURI
force UTF-8 (using system encoding does not make sense as there are 8 billion people in the world with varying languages, so they had to arbitrarily define the encoding as UTF-8 (like MSL in TCP)).
Whether you return 400 Bad Request
or use the URI without decoding in that case (it failed to decode to UTF-8) - any approach would be ok IMO.
However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".
Having developed a Node.js HTTP client (sorry for bragging!), I can definitely say that people widely send UTF-8 encoded characters (well, some even send non-UTF-8 but that's minority and we were enforcing UTF-8 for years, we had to disable UTF-8 enforcing in order to support legacy servers using other encodings).
I think the difference between the two at request time is indicated by the following part of the benchmark: for bun, #2714 is slightly faster; for deno and node, #2711 is faster. (This is a short URL benchmark, not including the %) But in any case, it's a small difference, so I wouldn't worry too much about it.
However, I wonder if such routing is ever used in a production environment. If not, I think we can consider the option of "not supporting it".
Even hono@v4.3.7 can process URLs containing percent encodings with the following routing
app.get('/users/:id', (c) => c.text(`id is ${c.req.param('id')}`))
$ curl 'http://localhost:3000/users/|ok|'
id is |ok|
$ curl 'http://localhost:3000/users/%7Cok%7C'
id is |ok|
What cannot be handled correctly is the case where "the routing definition contains symbols or multibyte characters".
app.get('/|', async (c) => c.text('ayy'));
I would like to merge #2714 because it is more correct for the framework to be able to handle these cases. However, I don't think there are any realistic users who would define such a routing for a web service endpoint that is widely accessed by various clients.
@szmarczak Sorry for not being clear on the subject. You were primarily targeting "errors that cannot be decoded as UTF-8", while I think I was primarily targeting "incorrect percent encoding errors".
decodeURI("%A4%A2")
: Japanese "あ" encoded in EUC-JPdecodeURI("%a")
: Invalid encodingYou are right, regarding the former, I think it would be best if the framework does not make it an error, but in a way that the user can handle it himself. As for the latter, I think it should return 400 as soon as possible.
Of course, getPath()
could be modified to handle these properly, and hono-base could be rewritten, but I don't want to make getPath()
too large.
I am considering this.
However, I don't think there are any realistic users who would define such a routing for a web service endpoint that is widely accessed by various clients.
It depends what you're developing. If you're developing an HTTP client, then you need to have a spec-compliant server to properly test the client. Or you just need static routes for simple mock server.
Also, being a Polish guy, I can say there are websites exposing static routes like /główna
which means /home
in English. The fix would allow these cases, but more importantly would be spec-compliant.
decodeURI("%a") : Invalid encoding
I wouldn't worry about this. First, it's against the spec. Second, it's impossible to get this via HTML forms (because forms are spec compliant). Proper fetch
usage using WHATWG URLs never results in malformed URIs. The only possible situation for this is when someone intentionally types that in a browser or curl. If anybody wants something against the spec, they're better defining their own protocol instead.
@szmarczak Thank you!
more importantly would be spec-compliant.
I agree with you on that point.
decodeURI("%a") : Invalid encoding
I wouldn't worry about this.
Yes, of course, a user trying to send a correct request would not send the /%a/
path. However, there may be clients who send it for an attack.
Suppose we have a routing definition like this,
app.get('/content/:path', (c) => c.text('content'))
app.get('*', (c) => fallback()))
And suppose we have the following URL
http://localhost/users/%63ontent/path-to-%a
I think it would be best if a request to this URL would result in an error before entering the router, rather than proceeding to fallback.
Prevent "intentionally slip through static paths using percent encoding" attacks.
As I said in a previous comment, I think #2714 has the goodness to prevent this. If this feature is compromised, I think #2711 would be sufficient.
As shown in the following test, I would expect #2711 to return the expected result for processing /|
(or /główna
), but would you prefer to choose #2714 over #2711?
Oh, by the way, there is another issue that is difficult to handle in #2714 in terms of spec-compliant.
Suppose we have a routing definition like this,
app.get('/główna/:path', (c) => c.text('główna'))
And suppose we have the following URL
http://localhost/g%C5%82%C3%B3wna/%A4%A2 # %A4%A2 is not a UTF-8 character
I think this is a spec-compliant URL, but it is difficult to get it routed in #2714 (while maintaining the spec of accepting non-UTF-8 encodings). #2711 would accomplish this.
I've come to the conclusion that with a few adjustments #2714 should be fine. Wait a minute.
Yes, of course, a user trying to send a correct request would not send the /%a/ path. However, there may be clients who send it for an attack.
That's a very good find! In this case it makes more sense to return 400 Bad Request
.
would you prefer to choose https://github.com/honojs/hono/pull/2714 over https://github.com/honojs/hono/pull/2711?
I believe there's no difference in terms of the result. I'd choose what's more performant (or maintainable) and link to the other possible solution if you ever needed to change the algorithm.
I think this is a spec-compliant URL
It is!
it is difficult to get it routed in https://github.com/honojs/hono/pull/2714 (while maintaining the spec of accepting non-UTF-8 encodings)
The main
branch is affected as well. The spec tells you that the percent encoded characters must be decoded, but it doesn't specify the encoding (making it a Uint8Array is unusable, if anybody wants to send binary data they're better using POST). Browsers send UTF-8, so I'd force it to be UTF-8 and 400 Bad Request
if it isn't (I'd leave decodeURIComponent
as is).
I'm not sure about not decoding non-UTF-8 characters in #2714, but I'm leaning towards being okay.
I'm satisfied with #2714 👍🏼
@szmarczak Thanks right away! I was going to comment on this, but I'm late.
Added process for invalid percent-encoding and invalid UTF-8 sequences.
Skip these sequences and apply decodeURI()
to the rest as much as possible.
https://github.com/honojs/hono/pull/2714/commits/d252d13f2a64bcc90e0a72e508607a7d10feba51
This will prevent the following URLs from also slipping through the routing
http://localhost/users/%63ontent/path-to-%a
http://localhost/g%C5%82%C3%B3wna/%A4%A2
(As was the case before this PR.) If an encoding other than UTF-8 is used, a URIError is thrown when trying to retrieve it with c.req.param()
. If you want to get it, you can get it by yourself from c.req.url
, or write your own getPath()
.
Invalid percent encodings, such as %a
, may be allowed to exit with a 400 error before invoking handlers, but that error checking is beyond the scope of this issue and is not addressed here.
@szmarczak Thanks again for all the thought-provoking comments so far, and for your thoughtful answers to all my questions.
I think #2711 is often marginally better in terms of performance.
However, #2714 has more room for getPath()
adjustments for users who want to do more complex things, and (not mentioned so far) regular expressions such as the following can be applied to decoded strings.
app.get('/:path{home|główna|ホーム}')
I think #2714 is a good choice because of these various good points.
@yusukebe How about we make this issue resolved by #2714, including error handling?
@fzn0x @mgrithm If you have any thoughts, feel free to comment.
@yusukebe Added "Specification Changes" item to descriptioin. #2714
@usualoma
It looks good to me! The implementation of tryDecodeURI()
and using app.onError
for URIError
is good.
Shall we go with it?
@yusukebe Thanks for the confirmation. Let's go with it!
Let's go with it. I don't have any suggestions; I trust everyone here. :)
What version of Hono are you using?
4.3.6
What runtime/platform is your app running on?
Deno
What steps can reproduce the bug?
Deno/Hono
Node
What is the expected behavior?
ayy
What do you see instead?
404 Not Found
Additional information
No response