proposal: WHATWG URL standard implementation

jasnell commented 8 years ago

The WHATWG URL Standard specifies updated syntax, parsing and serialization of URLs as currently implemented by the main Web Browsers. The existing Node.js url module parsing and serialization implementation currently does not support the URL standard and fails to pass about 160 of the standard tests.

This proposal is to implement the WHATWG URL Standard by introducing a new URL class off the url module (e.g. require('url').URL).

The existing url module would remain unchanged and there should be no backwards compatibility concerns.

Example

const url = new URL('http://user:pass@example.org:1234/p/a/t/h?xyz=abc#hash');

console.log(url.protocol;      // http:
console.log(url.username);     // user
console.log(url.password);     // password
console.log(url.host);         // example.org:1234
console.log(url.hostname);     // example.org
console.log(url.port);         // 1234
console.log(url.pathname);     // /p/a/t/h
console.log(url.search);       // ?xyz=abc
console.log(url.searchParams); // SearchParams object
console.log(url.hash);         // hash

// The SearchParams object is defined by the WhatWG spec also
url.searchParams.append('key', 'value');

console.log(url);
 // http://user:pass@example.org:1234/p/a/t/h?xyz=abc&key=value#hash

yosuke-furukawa commented 8 years ago

my concern is clear, sounds good.

rvagg commented 8 years ago

@jasnell I know you're resisting this because of the work, but I'm with the folks in here that would like a more concrete sense of what the difference is between our impl and the spec. I know you have a test case for it and can count the number of tests but having a clearer picture of what changes would be required to shift the current impl to be spec compliant would likely help the decision. Perhaps it's waaaay to much breakage and in that case it rules out a semver-major break, or perhaps it's edge-casey breakage as Domenic suggests and we'd all be comfortable with shifting the impl in a semver-major and communicating it clearly while we do it.

Re increasing API surface area, I'm very skeptical that if we were to implement a second URL parser that we could easily get rid of the original and that we'll be stuck in an awkward limbo of having to support both. I'm not at all comfortable giving a +1 to doing this unless we have clear data ruling out other options.

jasnell commented 8 years ago

@rvagg ... to be honest, I'd prefer to just replace the existing URL parser outright as a semver-major change, but I'm trying to be sensitive to the possibility of breaking things. Recent experiences around realpath, the module loader and stdio, as well as the breakage that occurred the last time significant changes to the existing url parser were attempted, would imply that being overly sensitive to any breaking change is warranted. In other words, even if breaking changes to the existing parser are minimal, I'd like to avoid making such breaking changes at all and instead provide an alternative implementation and API that users can transition to. So if the only option in core is to modify the existing parser in a non-backwards compatible way, I'd rather simply opt to iterate on whatwg-url as a userland module and recommend that developers use that as an alternative to the built in parser.

domenic commented 8 years ago

whatwg url spec depends on encoding spec too. (https://encoding.spec.whatwg.org/) will Encoding be separate modules like require('encodings') ? or internal only ?

There will be no real dependency for a Node implementation. It only depends on non-UTF-8 encodings when the "encoding override" parameter is used, which is only the case for web content.

tr46 is a real dependency for host parsing (and domainToUnicode/domainToASCII) and might be a real concern. https://github.com/Sebmaster/tr46.js/blob/master/index.js but also note that it includes a mapping table generated from IDNA.

Yeah, it will be necessary to expand the list of "special" schemes supported by the parser. This shouldn't be that difficult to do.

I would be against this. Browsers don't parse special schemes this way, and we should not either. git should be parsed the same as data, I believe.

I know you're resisting this because of the work, but I'm with the folks in here that would like a more concrete sense of what the difference is between our impl and the spec. I know you have a test case for it and can count the number of tests but having a clearer picture of what changes would be required to shift the current impl to be spec compliant would likely help the decision.

I believe the differences in parsing are edge-casey, but the differences in API surface are extreme, because of the use of getters and setters. That is part of what sunk the old attempt to rewrite the URL parser, and that was only to use getters for for speed, without even implementing setters for consistency. I think the API surface of the URL Standard is really nice, personally, and think it would be great for core to adopt it.

To head off any suggestions of having core implement the parser but leaving the API surface to userland, this isn't really feasible unfortunately, as the API surface requires a lot of direct hooks into the middle of the parser to implement e.g. the setters that keep things consistent. The full low-level API surface necessary to implement a higher API is given in https://github.com/jsdom/whatwg-url#low-level-url-standard-api; as you can see it's a lot more than a simple string -> URL record parser (with the state overrides alone adding a lot of entry points).

trevnorris commented 8 years ago

I would be against this. Browsers don't parse special schemes this way, and we should not either. git should be parsed the same as data, I believe.

Overall I agree with you, but I believe this point also violates the path of least surprise. If users can't parse the additional protocols then they're unlikely to ever adopt the new API.

jasnell commented 8 years ago

Just ran some tests. The whatwg-url parser handles git URLs with no problems. The chrome parser is the one with the issue on it. On Jun 2, 2016 10:54 AM, "Trevor Norris" notifications@github.com wrote:

I would be against this. Browsers don't parse special schemes this way, and we should not either. git should be parsed the same as data, I believe.

Overall I agree with you, but I believe this point also violates the path of least surprise. If users can't parse the additional protocols then they're unlikely to ever adopt the new API.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node-eps/pull/28#issuecomment-223370085, or mute the thread https://github.com/notifications/unsubscribe/AAa2eSLhgPOu2LTkRjAY6WYi0ShrUBgMks5qHxjTgaJpZM4IrtaC .

ben-page commented 8 years ago

This is great. After writing the little PR above, I wrote my own implementation of the whatwg-url parser. The current parser is in adequate many ways and I think people expect more parity with browser's implementations.

I haven't released my implementation, because it doesn't do URL serializing or equivalence. Also, I wasn't sure how to drum up interest in such a significant change to Node.js. Anyway, I'm be interested in help with this in anyway I can.

ben-page commented 8 years ago

On subject of scheme, a valid scheme can contain ASCII alphanumeric characters, +, -, or .. So parsing git schemes should not be a problem. The standard defines special schemes: ftp, file, gopher, http, https, ws, wss. But this list only exists to define the default IP port for these schemes. The one exception is the file scheme. It's parsed completely differently than other schemes.

jasnell commented 8 years ago

EPS proposal updated. /cc @nodejs/ctc

jasnell commented 8 years ago

Putting this on the ctc-agenda as I believe it's ready for discussion/review.

evanlucas commented 8 years ago

I would be against this. Browsers don't parse special schemes this way, and we should not either. git should be parsed the same as data, I believe.

Does npm rely on the parsing of urls with support for the git protocol? If so, do we have a migration path for that?

trevnorris commented 8 years ago

There's a certain loss of functionality by using new URL() over url.parse(). Specifically that url.parse() isn't just for URLs but also for URIs. For example: url.parse('github.com') works fine but new URL('github.com') throws.

Taking the following excerpt from the WHATWG URL specification:

Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.

It seems that URL is not meant to mean only URL. Even though there a scheme is always necessary (whether absolute or relative URL). Because of this any scheme is acceptable, but possibly parsed as expected. For example:

new URL('git://github.com/foo/bar').pathname == '//github.com/foo/bar';

Even though git: has been provisionally registered with the IANA it is still not parsed as a standard domain. It should be possible to properly properly parse the pathname from the above.

Taking from RFC3986 we can see a simple generic way of parsing the URL that I think could be used at minimum to parse unfamiliar schemes:

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment
          |   _____________________|__
         / \ /                        \
         urn:example:animal:ferret:nose

@domenic

I would be against this. Browsers don't parse special schemes this way, and we should not either. git should be parsed the same as data, I believe.

Not to be rude, but check your own work first. Chrome has added an exception to URL() for its own needs:

new URL('chrome-extension://adlfkjaslkfj/foo/bar.html').pathname === '/foo/bar.html';

Additionally, URL() today doesn't parse unknown schemes as data:

const u = new URL('git://github.com/foo/bar');
u.protocol === 'git:';
u.origin === 'git://';

All this said, we should support other/unknown schemes properly with URL().

domenic commented 8 years ago

It's already been clarified upthread that I was incorrect about special schemes. There are no special schemes besides file:

trevnorris commented 8 years ago

@domenic oop. sorry about that. completely glanced over that comment.

jasnell commented 8 years ago

Adding git as a special scheme would be trivial if necessary. If necessary we should first try to upstream such a change into the URL spec itself.

On Jul 13, 2016 10:13 PM, "Trevor Norris" notifications@github.com wrote:

@domenic https://github.com/domenic oop. sorry about that. completely glanced over that comment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node-eps/pull/28#issuecomment-232558585, or mute the thread https://github.com/notifications/unsubscribe/AAa2eczDLckC0l3fAaMSqdd27bHxSOJVks5qVcV-gaJpZM4IrtaC .

jasnell commented 8 years ago

After further testing, git URLs are handled by this implementation without any problems whatsoever using the implementation in https://github.com/nodejs/node-eps/pull/28.

jasnell commented 8 years ago

The CTC discussed this today and has decided to advance this to DRAFT status with the exception of the URL constructor not being a global. I will get the PR updated to remove part about it being a global and will get it landed. Thank you all.

yorkie commented 8 years ago

By the way, after the core implemented the WHATWG URL standard, I think also should start deprecating the path in the http module at here where we were using the path in request option.

jasnell commented 8 years ago

@yorkie ... eventually, once the new URL object is no longer experimental, my goal is to replace use of the existing url.parse within the http module entirely. That's not going to be easy tho. It's going to take a long time.

jasnell commented 8 years ago

Landed in f16f770

nodejs / node-eps

proposal: WHATWG URL standard implementation #28

Example