unshiftio / url-parse

Small footprint URL parser that works seamlessly across Node.js and browser environments.
http://unshift.io
MIT License
1.03k stars 104 forks source link

Incorrect protocol/hostname parsing #203

Closed Przytua closed 3 years ago

Przytua commented 3 years ago

According to URL and URI definitions, after the scheme (or protocol) followed by colon (:), there is an optional authority (or hostname) component preceded by two slashes (//).

The File URI describes exactly how many slashes can be a part of the authority as:

A valid file URI must therefore begin with either file:/path (no hostname), file:///path (empty hostname), or file://hostname/path.

When parsing an iOS file URL which looks like this:

file:///Users/username/Library/Developer/CoreSimulator/Devices/00000000-0000-0000-0000-000000000000/data/Containers/Data/Application/00000000-0000-0000-0000-000000000000/Documents/Inbox/dummy.pdf

the parser incorrectly takes all 3 slashes as a host predecessor (due to protocol regex looking for ([\\/]{1,})?, so 1 or more slashes), and takes the Users as a host, and the rest (starting with /username) as the path. To properly parse this file URL, only 2 slashes should be used as a host predecessor, while 3rd slash should be the beginning of a path (thus resulting with an empty host).

Exact issue that this is causing is that the host is lowercased, and toString function always uses 2 slashes when building the string, so parsing above file URL, and using toString on it results with:

file://users/username/Library/Developer/CoreSimulator/Devices/00000000-0000-0000-0000-000000000000/data/Containers/Data/Application/00000000-0000-0000-0000-000000000000/Documents/Inbox/dummy.pdf

which is an incorrect file URL on iOS, and cannot be accessed.

Changing the protocolre from /^([a-z][a-z0-9.+-]*:)?([\\/]{1,})?([\S\s]*)/i to /^([a-z][a-z0-9.+-]*:)?([\\/]{2})?([\S\s]*)/i (so that ([\\/]{2})? will take 0 or 2 slashes) fixes this issue.

lpinca commented 3 years ago

It actually depends on the protocol and url-parse is not smart enough to handle the file: protocol correctly.

> new URL('http:///Users/username/Library/dummy.pdf').toString()
'http://users/username/Library/dummy.pdf'
> new URL('file:///Users/username/Library/dummy.pdf').toString()
'file:///Users/username/Library/dummy.pdf'

It worked as expected before https://github.com/unshiftio/url-parse/commit/d1e7e8822f26e8a49794b757123b51386325b2b0

> var parse = require('.');
undefined
> parse('file:///Users/username/Library/dummy.pdf').toString();
'file:///Users/username/Library/dummy.pdf'
krassowski commented 3 years ago

Hi @lpinca this issue seems to be causing some headache for us in https://github.com/krassowski/jupyterlab-lsp/issues/595 (if my understanding of things is correct). We would prefer to avoid reverting back to an earlier version as it was labelled as a security fix (and because we don't have a direct control over version being used as it is a peer dependency in a federated extensions system).

Would the regexp change suggested by @Przytua work for you? If yes, I would be happy to open a PR implementing the fix together with a unit test, and we would greatly appreciate if it could make it into 1.5.2 release soon.

lpinca commented 3 years ago

I think it will invalidate some of the existing tests.

lpinca commented 3 years ago

@Przytua @krassowski the following patch

diff --git a/index.js b/index.js
index 72b27c0..61ba3c1 100644
--- a/index.js
+++ b/index.js
@@ -224,7 +224,9 @@ function Url(address, location, parser) {
   // When the authority component is absent the URL starts with a path
   // component.
   //
-  if (!extracted.slashes) instructions[3] = [/(.*)/, 'pathname'];
+  if (!extracted.slashes || url.protocol === 'file:') {
+    instructions[3] = [/(.*)/, 'pathname'];
+  }

   for (; i < instructions.length; i++) {
     instruction = instructions[i];
@@ -288,7 +290,10 @@ function Url(address, location, parser) {
   // Default to a / for pathname if none exists. This normalizes the URL
   // to always have a /
   //
-  if (url.pathname.charAt(0) !== '/' && url.hostname) {
+  if (
+    url.pathname.charAt(0) !== '/' &&
+    (url.hostname || url.protocol === 'file:')
+  ) {
     url.pathname = '/' + url.pathname;
   }

@@ -430,7 +435,7 @@ function toString(stringify) {

   if (protocol && protocol.charAt(protocol.length - 1) !== ':') protocol += ':';

-  var result = protocol + (url.slashes ? '//' : '');
+  var result = protocol + (url.slashes || url.protocol === 'file:' ? '//' : '');

   if (url.username) {
     result += url.username;

fixes the isse, but I'm not sure if it has side effects.