sindresorhus / normalize-url

Normalize a URL
MIT License
837 stars 123 forks source link

Percent-decoding entire URL components is not valid #180

Open karwa opened 1 year ago

karwa commented 1 year ago

These operations are not valid:

https://github.com/sindresorhus/normalize-url/blob/3c7235ebb06a6ec8b753d4d9d24d0b7adf2d425b/index.js#L178-L183

https://github.com/sindresorhus/normalize-url/blob/3c7235ebb06a6ec8b753d4d9d24d0b7adf2d425b/index.js#L243-L245

Since URLs are a textual format, certain characters have semantic meaning. Percent-encoding can be used to escape those characters. For example, if we want a single path component named "AC/DC", we'll have a problem, because "/" can mean a path separator:

/music/bands/AC/DC
             ^^^^^ - 2 components! ❌

So instead, we have to escape the use of "/" within name "AC/DC":

/music/bands/AC%2FDC
             ^^^^^^^ - 1 component ✅

If you percent-decode the entire path string, we irreversibly lose the information that "AC/DC" was supposed to be a single path component.

Instead, the correct way to do this is to split the component (still escaped) up in to its constituent parts, decode each component, escape any characters with semantic meaning, and join them up again. For the path, that means breaking it up in to path components and ensuring "/" and "\" characters are escaped again in each component before you rebuild the path string. For the query, it means doing the same for each key and value (not each key-value pair - they need to be broken down in to their smallest subcomponents).

sindresorhus commented 1 year ago

// @oswaldosalazar @ludofischer

karwa commented 1 year ago

Ah my bad - the path example is okay because decodeURI won't unescape some reserved characters, like "/", so it will keep the path component as "AC%2FDC". But for the query decoding, decodeURIComponent really does just unescape everything:

var url = new URL("http://example/?show=Tom%26Jerry&episode=3");

url.href;
// 'http://example/?show=Tom%26Jerry&episode=3'
//                          ^^^ - ❗️
url.searchParams.get("show");
// 'Tom&Jerry'

url.search = decodeURIComponent(url.search);

url.href;
// 'http://example/?show=Tom&Jerry&episode=3'
//                          ^ - ❗️
url.searchParams.get("show");
// 'Tom'
ludofischer commented 1 year ago

As far as I remember decodeURIcomponent was introduced because some people complained that nomalize-url would break their URLs when the search parameters contained a forward slash (it would encode the forward slash as a side effect of sorting). As I see it, the library should not change the encoding unless the user asks for it. Since sorting is responsible via searchParams.sort() for changing the encoding, could the solution be to sort them by hand? Or is this more complicated than it looks? Interesting, it looks like even server-side frameworks that do file-based routing end up with issues with decodeURIComponent. https://github.com/sveltejs/kit/issues/7577

ludofischer commented 1 year ago

The problem is that URL.searchParams gives you the parameter already decoded, so we can't rely on that to preserve the original encoding. We can't use decodeURI instead of decodeURIComponent because decodeURI does not bring back the forward slash (it keeps encoded as %2F).