nodejs / node-eps

Node.js Enhancement Proposals for discussion on future API additions/changes to Node core
443 stars 66 forks source link

proposal: WHATWG URL standard implementation #28

Closed jasnell closed 7 years ago

jasnell commented 8 years ago

The WHATWG URL Standard specifies updated syntax, parsing and serialization of URLs as currently implemented by the main Web Browsers. The existing Node.js url module parsing and serialization implementation currently does not support the URL standard and fails to pass about 160 of the standard tests.

This proposal is to implement the WHATWG URL Standard by introducing a new URL class off the url module (e.g. require('url').URL).

The existing url module would remain unchanged and there should be no backwards compatibility concerns.

Example

const url = new URL('http://user:pass@example.org:1234/p/a/t/h?xyz=abc#hash');

console.log(url.protocol;      // http:
console.log(url.username);     // user
console.log(url.password);     // password
console.log(url.host);         // example.org:1234
console.log(url.hostname);     // example.org
console.log(url.port);         // 1234
console.log(url.pathname);     // /p/a/t/h
console.log(url.search);       // ?xyz=abc
console.log(url.searchParams); // SearchParams object
console.log(url.hash);         // hash

// The SearchParams object is defined by the WhatWG spec also
url.searchParams.append('key', 'value');

console.log(url);
 // http://user:pass@example.org:1234/p/a/t/h?xyz=abc&key=value#hash
jasnell commented 8 years ago

@nodejs/collaborators

seishun commented 8 years ago

Perhaps we could raise interest in this by providing examples of failing tests.

jasnell commented 8 years ago

@seishun ... https://github.com/nodejs/node/blob/master/test/known_issues/test-url-parse-conformance.js :-)

We currently fail somewhere around 140+ of the test cases in the WhatWG set.

MylesBorins commented 8 years ago

+1

Considering that the browser are exposing the global I think it makes a whole bunch of sense. I'd be interested in helping out with this. Have you broken ground on implementation @jasnell?

Can we borrow from implementations at all?

jasnell commented 8 years ago

@TheAlphaNerd ... yeah, I've got it mostly implemented already. The next step is to start running it through it's paces with tests and benchmarks and to find ways of optimizing the implementation. It's currently quite a bit slower than the existing require('url') module parsing.

jasnell commented 8 years ago

Regarding borrowing from other impls, it's entirely possible that we could borrow from chromes implementation. I'm not sure yet if theirs is a pure JS impl or not. I'll look into that.

yorkie commented 8 years ago

@jasnell How about the following 2 static methods which are defined at standard IDL:

I didn't see those in the proposal :-(

jasnell commented 8 years ago

Still considering those. They are easy enough to implement given the punycode module but I'm not sure how extensively they're used. On Jun 1, 2016 9:40 AM, "Yorkie Liu" notifications@github.com wrote:

@jasnell https://github.com/jasnell How about the following 2 static methods which is defined at IDL:

  • domainToASCII(domain)
  • domainToUnicode(domain)

I didn't see those in the proposal :-(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node-eps/pull/28#issuecomment-223052044, or mute the thread https://github.com/notifications/unsubscribe/AAa2eVA3NG0j4lEZnCUSro0RREhVL5wBks5qHbXqgaJpZM4IrtaC .

jasnell commented 8 years ago

There will also be other differences. For instance, I'm not sure if we need the host parsing component. On Jun 1, 2016 9:42 AM, "James M Snell" jasnell@gmail.com wrote:

Still considering those. They are easy enough to implement given the punycode module but I'm not sure how extensively they're used. On Jun 1, 2016 9:40 AM, "Yorkie Liu" notifications@github.com wrote:

@jasnell https://github.com/jasnell How about the following 2 static methods which is defined at IDL:

  • domainToASCII(domain)
  • domainToUnicode(domain)

I didn't see those in the proposal :-(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node-eps/pull/28#issuecomment-223052044, or mute the thread https://github.com/notifications/unsubscribe/AAa2eVA3NG0j4lEZnCUSro0RREhVL5wBks5qHbXqgaJpZM4IrtaC .

targos commented 8 years ago

I'm not sure yet if theirs is a pure JS impl or not.

There is https://github.com/jsdom/whatwg-url

cc @domenic

jasnell commented 8 years ago

Yeah, I'm familiar with (and use) whatwg-url. @domenic, how would you feel about the possibility of pulling the whatwg-url implementation into core? I'm not yet that familiar with how it is implemented internally but having a url standard compliant url parser built into core would be very good.

domenic commented 8 years ago

Yep, that's @Sebmaster's most excellent work. It's not super optimized, but would be a good starting point.

This thread is very exciting and I'm glad there is an appetite for the idea!! The idea of a global, the same as in browsers, is great.

I think there are two separable problems here:

There are tests of the URL Standard at https://github.com/w3c/web-platform-tests/tree/master/url, and whatwg-url has a runner. The coverage is pretty reasonable; see https://github.com/w3c/web-platform-tests/issues/3018

domenic commented 8 years ago

@domenic, how would you feel about the possibility of pulling the whatwg-url implementation into core?

Oh, that'd be very cool! I guess my only concern is that we weren't concerned with speed when writing it, so there is probably lots of low-hanging fruit for performance improvements.

You'd also need to do a bit of work to decouple it from webidl-conversions and webidl2js. If you npm install it you'll find that it follows the same impl/wrapper strategy as browsers do, where there's a "wrapper" that takes care of USVString conversion and so on, delegating to the "impl" where parsing occurs, which in turn delegates to the URL state machine code. At least one of these layers could be disintermediated, although perhaps benchmarking should be done to see if that's actually the area where most improvement could be made.

jasnell commented 8 years ago

Ok, I'll dig in and explore the whatwg-url internals and see what can be done reasonably. Before getting too deep into this I'd definitely like to get more +1's from collaborators tho. I'd really like to see this happen tho.

jasnell commented 8 years ago

I think there are two separable problems here

Definitely agree that it's worth separating these. Not sure about modifying the existing require('url') too much tho -- ideally once the global URL is there for a while we could simply deprecate require('url') (and possibly even require('querystring')) entirely with an eye towards reducing the Node.js-specific API surface area.

Qard commented 8 years ago

Well, you can have a 👍 from me! More browser/server unity would be great. 😺

Qard commented 8 years ago

BTW, I like the idea of the global, since that's what browsers do, but I'd recommend not exposing it in require('url') because then people might get used to it being there rather than available globally, making future removal of the module more difficult.

jasnell commented 8 years ago

The only concern I would have with that, @Qard, is that if there is existing code that does URL = whatever, there would be no way of recovering the original global. That should be a limited edge case, however.

chrisdickinson commented 8 years ago

This is not an argument for or against, but a request for more background. You've described what work you would like to do, and how you would do it — but I don't see much in the way of "why" we'd want to do this. What is not working about the status quo that we will rectify by adding a new global object? What value do Node users gain, concretely, from the URL object?

Qard commented 8 years ago

Maybe store it on process.URL for safe keeping?

MylesBorins commented 8 years ago

@chrisdickinson personally I see quite a bit of benefit in minimizing the delta between node + the browser as far as utility API's like this are concerned.

jasnell commented 8 years ago

@chrisdickinson ... the why is straightforward: Currently Node.js' URL parsing has a number of issues in terms of not following the standardized behavior implemented by browsers. Examples of those failures can be seen in the test case I referenced here. There are also differences in the Node.js provided API that are largely unnecessary. This work would give us an opportunity to not only provide more robust URL parsing, but to provide a unified, non-Node.js specific API.

@Qard ... that's certainly a possibility

jasnell commented 8 years ago

FWIW, looks like I mis-remembered the number of failures ;-) ... here's the exact count:

bash-3.2$ ./node test/known_issues/test-url-parse-conformance.js 
Unknown globals: URL

assert.js:90
  throw new assert.AssertionError({
  ^
AssertionError: 160 failed tests (out of 352)
    at Object.<anonymous> (/Users/james/Node/main/node/test/known_issues/test-url-parse-conformance.js:57:8)
    at Module._compile (module.js:541:32)
    at Object.Module._extensions..js (module.js:550:10)
    at Module.load (module.js:458:32)
    at tryModuleLoad (module.js:417:12)
    at Function.Module._load (module.js:409:3)
    at Function.Module.runMain (module.js:575:10)
    at startup (node.js:152:18)
    at node.js:449:3
bash-3.2$ 
joepie91 commented 8 years ago

My general dislike of constructors aside, I have to second @chrisdickinson's concern here. It's not clear to me why we'd want this as a global. Sure, that's how browsers implement it, but Node is not a browser, and browser APIs have historically been designed with a fundamentally different model in mind (namely, the lack of CommonJS).

This proposal, in its current form, is starting to look like it'll drag Node in the direction of PHP - endlessly adding APIs and not really trying to enforce or encourage consistency. As it is, this will just further confuse users as to whether they should be require()ing a module or expecting a global to be there - not to mention it being unclear whether they should use the url module or the URL global. Bundling tools already take care of adding in 'fake' modules that point at browser APIs, so what exactly are we gaining here by trying to make it "more like browser APIs"?

As a separate question, how - if at all - does this cover the "insert object of URL components, receive stringified URL" usecase?

EDIT: Further question: Are there any major concerns about just fixing the existing url module in a next major (breaking) release? This would seem preferable to me, but I don't know if there are any major roadblocks that I might not be aware of, or whether this has been discussed somewhere before.

jasnell commented 8 years ago

Let's separate the concerns just a bit. We could implement this but not make it a global, and just have it accessible via the existing url module (e.g. const URL = require('url').URL). We could go this route and still eventually deprecate the Node.js specific API. I would be ok with going that route. This proposes it as a global to be consistent with browsers, but that's not critical if we cannot get consensus on it.

Second, on the point:

endlessly adding APIs and not really trying to enforce or encourage consistency

My goal here is to eventually be able to fully deprecate the existing url and querystring modules with the hope of reducing the Node.js specific API surface area. Obviously that's not something that would happen quickly, tho, so the concern is definitely noted.

does this cover the "insert object of URL components, receive stringified URL" usecase?

It would in-so-far as the URL object as defined by the WHATWG spec includes toString() and the href property to provide serialization of the object. If what you're referring to is the ability to create a non-URL object and serializing it, that's not something that's currently supported by the WHATWG spec but it's something that can be easily maintained.

joepie91 commented 8 years ago

We could implement this but not make it a global, and just have it accessible via the existing url module (e.g. const URL = require('url').URL). We could go this route and still eventually deprecate the Node.js specific API. I would be ok with going that route. This proposes it as a global to be consistent with browsers, but that's not critical if we cannot get consensus on it.

That seems like a more workable solution to me. There would still likely be short-term user confusion while both APIs exist, however, so this would be something that'd require very clear documentation. I realize that you might've missed the question I added to my previous post later on - could you have a look at that suggestion as well?

My goal here is to eventually be able to fully deprecate the existing url and querystring modules with the hope of reducing the Node.js specific API surface area. Obviously that's not something that would happen quickly, tho, so the concern is definitely noted.

I understand that, but I still don't really see the value in trying to bring the Node.js API in line with browser APIs, as opposed to just keeping it internally consistent (given that Node.js was designed with CommonJS in mind, and switching to globals for everything would be a step back). In the end, Node.js is its own environment, and I can't see how trying to reduce the API to the lowest common denominator would benefit usability.

If what you're referring to is the ability to create a non-URL object a serializing it, that's not something that's currently supported by the WHATWG spec but it's something that can be easily maintained.

That's probably what I mean, yeah - I'm thinking of something along the lines of the url.format method, that can just directly accept a POJO.

alfiepates commented 8 years ago

I believe it is valuable to implement the WhatWG URL Standard, but I don't understand why you would implement it as a global instead of simply fixing the url module.

IMO @joepie91's concern is valid; I think we should stick to the existing accepted behavior. There's a reason Node.js is so easy to write, and that's because (in the vast majority of cases I have experienced), it makes sense, and it makes sense because it's consistent.

ChALkeR commented 8 years ago

@jasnell

This is the complete list of current properties of the global object that were not defined in the ECMA specs:

Almost all of those are there for a very good reason, and adding more should be considered very thoughtfully. I don't think that URL justifies that. We could add fetch next. And XMLHttpRequest. And so on — the list would be of indefinite length, once we start.

Upd: for comparison, Chrome has more than 700 of those, not counting the ones defined by ECMA specs.

That said, I'm +1 on the idea of bringing URL to Node.js and perhaps eventually replacing Node.js specific api with it. And I think it's a great idea.

Not sure about exposing URL on require('url'), though, if the long-term plan is to deprecate url. Perhaps a different module name?

jasnell commented 8 years ago

Ok, I think the building trend is to avoid the introduction of a new global, and that is fine. I'll pull that from the proposal. It can be revisited later on if necessary.

instead of simply fixing the url module

My primary concern here is backwards compatibility. I don't simply want to modify the existing url module because doing so would likely breaking existing code. Also, fixing the existing code to be compliant with the WHATWG spec would be roughly the same amount of work as simply doing a clean new implementation. By implementing the URL object as a parallel API, we can provide a clear transition path off the Node.js specific API and onto the standard API, then fully deprecate and eventually remove the existing API.

Done properly, the impact to existing code should be as minimal as possible, and if we are not introducing this as a global, then we would need to, at the very least, keep the require('url') module but eventually deprecate the Node.js specific APIs exposed by it. e.g.:

// deprecated APIs
const url = require('url');
url.parse();
url.format();
url.resolve();

// new API
const URL = require('url').URL;
new URL(url, base)
URL.format(urlLikeObject)
alfiepates commented 8 years ago

@jasnell Alright, I'm more than comfortable with that.

jasnell commented 8 years ago

In fact, thinking about it further, we may be able to simplify the transition even more by simply having the url module directly export the new URL object and hanging the deprecated parse() and resolve() methods as statics off of it. Existing code should be unaffected.

const URL = require('url');
new URL(url, base);
URL.format(urlLikeObject);

// deprecated API
URL.parse(); // the existing parse method
URL.resolve(); // the existing resolve method

(Anyone currently doing const url = require('url') would see no real difference)

domenic commented 8 years ago

I'd caution against extending the standard URL with nonstandard methods, if the goal is to be able to help write cross-environment code. That includes .format.

jasnell commented 8 years ago

@domenic ... noted ... then perhaps simply

const url = require('url');

// deprecated APIs
url.parse();
url.resolve();

// retained existing API
url.format(urlLikeObject);

// new API
const URL = require('url').URL;
new URL(url, base);
jasnell commented 8 years ago

Updated based on the feedback

chrisdickinson commented 8 years ago

@jasnell:

... the why is straightforward: Currently Node.js' URL parsing has a number of issues in terms of not following the standardized behavior implemented by browsers. Examples of those failures can be seen in the test case I referenced here. There are also differences in the Node.js provided API that are largely unnecessary. This work would give us an opportunity to not only provide more robust URL parsing, but to provide a unified, non-Node.js specific API.

Have the differences between WHATWG's URL spec and Node's caused errors for users? Issue links would help give the proposal concrete grounding.

My goal here is to eventually be able to fully deprecate the existing url and querystring modules with the hope of reducing the Node.js specific API surface area. Obviously that's not something that would happen quickly, tho, so the concern is definitely noted.

This is notably something that we've tried before and have not been successful at; both in the larger sense of deprecation and API surface reduction, and in the specific sense that we've tried to replace the URL subsystem with a more spec-compliant version. What is different about this approach that we should expect a different outcome than before?

jasnell commented 8 years ago

@chrisdickinson ... quick survey of URL related issues/prs ...

The difference with this approach is that the existing url parse would not be replaced outright. To support existing users, it would be maintained and soft-deprecated at first, then hard deprecated in the next major beyond that, giving a clear transition path. Also, given that this moves towards using an API that many js developers are already familiar with, there is increased incentive to change.

chrisdickinson commented 8 years ago

Thanks for the links.

Assuming it makes sense for us to address the problems folks are having with the url module, who maintains the new URL parser? Is this something that we will vendor from chromium? If so, how quickly does it change, & how easy is it to pull new versions? In the worst case, are we stuck maintaining two URL parsers? In the best case, do we only have the deprecated url parser, or no url parser?

The difference with this approach is that the existing url parse would not be replaced outright. To support existing users, it would be maintained and soft-deprecated at first, then hard deprecated in the next major beyond that, giving a clear transition path.

If Node v9 came out with a hard deprecation (printed warning) on all url module usage, do you imagine we'd see less url use, or less adoption of Node v9?

chrisdickinson commented 8 years ago

I agree with @TheAlphaNerd that there's definitely value in reducing the differences between browser APIs and Node APIs, but I want us to be clear about what we're trying to solve, why we're trying to solve it, and how much we think it'll cost us.

Spec adherence for spec adherence's sake has some value, but not necessarily enough value to justify moving the ecosystem's cheese. Spec adherence as a side-effect of solving concrete problems our users are facing it has much more value. Does it have enough value for us to justify potentially maintaining two URL parsers into the future?

For my part, I lean towards "yes", but I'd prefer that information to be in the proposal so others can make that evaluation & refine our collective predictions about maintenance cost.

jasnell commented 8 years ago

@chrisdickinson ... the implementation will most likely be a mix of green-field and stuff adopted/borrowed from the whatwg-url module. Now that I've had a chance to dig into the details of that module's code, there is quite a bit that would need to be optimized to get the necessary performance. What I would likely do is create a fork of that module and begin working on various optimizations. From there it can be adapted into core. Done, properly, the changes made to optimize performance could go back into the whatwg-url module also, allowing us to easily pull any modifications that are made there back in.

trevnorris commented 8 years ago

-1 for global, and I don't like the idea of a constructor. For quite a while I've wanted to simply change node's existing parser to match the WhatWG spec, and release it on a semver-major. Not because I give a crap about complying with the browser, but because this feature has enough edge cases where it could be confused to be the same based on only a sample of URLs. Path of least surprise and all. If that's a no go then I can live with require('url').URL.

jasnell commented 8 years ago

The path of least surprise would be to use the constructor since that's exactly how things work on the browser side and is exactly what the WHATWG spec defines.

Given enough time we could replace the require('url').parse() method to follow the more compliant parsing but part of the while point of doing things the way I've proposed is to avoid the semver-major breaking change in the near term in favor of a more incremental approach (that is, add URL in v6; soft deprecate parse() in v7; with hard deprecation in v8). On Jun 1, 2016 8:14 PM, "Trevor Norris" notifications@github.com wrote:

-1 for global, and I don't like the idea of a constructor. For quite a while I've wanted to simply change node's existing parser to match the WhatWG spec, and release it on a semver-major. Not because I give a crap about complying with the browser, but because this feature has enough edge cases where it could be confused to be the same based on only a sample of URLs. Path of least surprise and all. If that's a no go then I can live with require('url').URL.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node-eps/pull/28#issuecomment-223185047, or mute the thread https://github.com/notifications/unsubscribe/AAa2eTOQ3AO_5jAQt9lXctPXVg9oFoe1ks5qHkqZgaJpZM4IrtaC .

trevnorris commented 8 years ago

Yeah. I recede my point about not wanting a constructor. Since parse() returns an object fundamentally things wouldn't change. I'm cool with that. Still no global though. And I do think that instead of deprecating .parse() it would be better to just move to the same parser as URL().

ChALkeR commented 8 years ago

@jasnell I am not sure if hard deprecation in v8 would be achievable.

jasnell commented 8 years ago

Possibly (likely) not. As I said, it's a goal :). Realistically it would likely take longer unless we decided to simply flip the switch as @trevnorris suggests On Jun 1, 2016 8:34 PM, "Сковорода Никита Андреевич" < notifications@github.com> wrote:

@jasnell https://github.com/jasnell I am not sure if hard deprecation in v8 would be achievable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node-eps/pull/28#issuecomment-223187207, or mute the thread https://github.com/notifications/unsubscribe/AAa2eWoXpHd8b25DQb0Xa5icIpjrmTwtks5qHk9UgaJpZM4IrtaC .

Jxck commented 8 years ago

questions.

  1. whatwg url spec has URLSearchParams which duplicate role with querystring module. querystring will leave as is ?
  2. whatwg url spec depends on encoding spec too. (https://encoding.spec.whatwg.org/) will Encoding be separate modules like require('encodings') ? or internal only ?
jasnell commented 8 years ago

The querystring module would remain unchanged. URLSearchParams would be exposed only via the URL class. It's implementation could easily be backed by the querystring module, in fact (I've already done this actually).

No part of the encodings spec would be exported or visibly implemented. Essentially, in terms of new API surface, this would only export the URL and URLSearchParams objects.

Qard commented 8 years ago

I'm okay with non-global. It's likely lots of people would use a polyfill module anyway, which does the native-or-polyfill return type thing, for wider support.

yosuke-furukawa commented 8 years ago

I am a bit concerned about schema. According to spec, whatwg url handles the following schema,

But Node.js needs to handle other schema like git...

I tried but I got the following results.

in chrome

> new URL('git://github.com/foo/bar');
host:""
hostname:""
href:"git://github.com/bar/buz"
origin:"git://"
pathname:"//github.com/bar/buz"
port:""
protocol:"git:"

in node

> url.parse('git://github.com/foo/bar');
protocol: 'git:',
host: 'github.com',
port: null,
hostname: 'github.com',
hash: null,
search: null,
query: null,
pathname: '/bar/buz',
path: '/bar/buz',
href: 'git://github.com/bar/buz'
Jxck commented 8 years ago

FYI. I worked on same topic years ago (in typescript because of following WebIDL static interface) the goal was implement perfect whatwg fetch, includes perfect whatwg url.

if you will implement url, you requires thease.

most difficult thing is domain to asci and domain to unicode. http://www.unicode.org/reports/tr46/#ToASCII I gave up this point :(

jasnell commented 8 years ago

Yeah, it will be necessary to expand the list of "special" schemes supported by the parser. This shouldn't be that difficult to do. On Jun 1, 2016 8:59 PM, "Yosuke Furukawa" notifications@github.com wrote:

I am a bit concerned about schema. According to spec https://url.spec.whatwg.org/#special-scheme, whatwg url handles the following schema,

  • ftp
  • http
  • https
  • file
  • gopher
  • ws
  • wss

But Node.js needs to handle other schema like git...

I tried but I got the following results.

in chrome

new URL('git://github.com/foo/bar'); host:"" hostname:"" href:"git://github.com/bar/buz" origin:"git://" pathname:"//github.com/bar/buz" port:"" protocol:"git:"

in node

url.parse('git://github.com/foo/bar'); protocol: 'git:', host: 'github.com', port: null, hostname: 'github.com', hash: null, search: null, query: null, pathname: '/bar/buz', path: '/bar/buz', href: 'git://github.com/bar/buz'

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node-eps/pull/28#issuecomment-223189751, or mute the thread https://github.com/notifications/unsubscribe/AAa2ebePH8kw50R2HkqBrRDDD6ZF_ffVks5qHlULgaJpZM4IrtaC .