url: resolve does not canonicalize .s and ..s when switching protocols

domenic commented 11 years ago

URL.resolve('http://a.com/b/../c/./a', 'http:/q/r/./c/d/.././e/../f');
URL.resolve('http://a.com/b/../c/./a', 'https:/q/r/./c/d/.././e/../f');

Gives

http://a.com/q/r/c/f
https://q/r/./c/d/.././e/../f

The first is correct; the second does not match browsers; see e.g.:

<html>
<head>
<base href="//a.com/b/../c/./a" />
</head>
<body>
<a href="https:/q/r/./c/d/.././e/../f" id="x"></a>
<script>
alert(document.getElementById("x").href);
</script>
</body>
</html>

bnoordhuis commented 11 years ago

You're mixing protocols here, http and https. I'm not sure if there is a canonical, unambiguously good way to deal with that except maybe by taking the path of least resistance and doing what browsers do - assuming they all implement it the same. There is also the question of what to do when you're dealing with disparate URL schemes, say file:// and git+ssh://

domenic commented 11 years ago

Yes, browsers all implement this in the same way, according to the URL spec. http://url.spec.whatwg.org/

bnoordhuis commented 11 years ago

Can you point me to where in that spec the canonical approach to merging base URLs and normal URLs is documented? From my reading, it acknowledges that base URLs exist but doesn't explain how to deal with them.

awwright commented 11 years ago

It appears to me that neither of the described behaviors is correct. It is necessarily true that if a scheme is defined, then the URI reference is an absolute URI. The correct behaviors for resolving the URIs are:

resolve(<http:/q/r/./c/d/.././e/../f>) -> <http:/q/r/c/f> resolve(<https:/q/r/./c/d/.././e/../f>) -> <https:/q/r/c/f>

But let me investigate... It may be the case there is scheme-specific behavior which the Web browser is allowed to apply, if the authority is undefined (not merely blank).

It likely has to do with strict resolution vs. non-strict resolution.

Node.js should definitely be resolving those paths in any case, though.

domenic commented 11 years ago

@bnoordhuis from my understanding the parsing algorithm (which also is responsible for resolution) doesn't distinguish between "base" and "normal" URLs, but simply gives the algorithm for parsing one URL with respect to another base URL; both can be in any form (since in practice, web browsers allow any value for the base[href] and a[href] attributes). The algorithm is at http://url.spec.whatwg.org/#parsing

@annevk can tell us more

awwright commented 11 years ago

What do we mean by "normal" URL? A URL resolution function looks like: resolve(URI base, URIref reference) -> URI resolved

So do you mean the URI reference being resolved?

The behavior appears to be due to the fact that Node.js applies non-strict URI resolution behavior: If the schemes are identical, it ignores the scheme in reference.

The URI definition allows schemes to apply some scheme-specific behaviors. Like I said, Web browsers may be applying some scheme-specific behavior, which is likely legal (The URI is defined in RFC 3986 which provides for a number of scheme-specific behaviors). If so, this would be the task of a scheme-specific URL resolver to apply. I don't think this particular behavior is the case, scheme-specific behaviors are necessarily the same resource as resolved by the generic resolver. But if there were a scheme-specific handling of the authority, this could imply that one domain name identifies the same resource as under another domain, which is obviously not (necessarily) true (so maybe it's not legal).

But I need to take a look at the file source, Node.js is doing something wrong. At the very least, the paths should be fully normally resolved regardless.

annevk commented 11 years ago

@Acubed actually, STD66 does not allow scheme-specific reference resolution. But STD66 is not what browsers implement (and what they implement does in fact violate STD66). The specification @domenic mentioned attempts to match what browsers implement, but unfortunately browsers are still different and have not quite converged yet (as they have for HTML parsing).

awwright commented 11 years ago

@annevk Well I can imagine a consistent scheme-specific resolution function that can return a different URI for the same resource -- for instance, http://example.com:80 is the same resource as http://example.com/ and therefore a scheme-specific resolution function could return the latter (being the normalized form). The generic URI resolution function already performs path-based normalization on the result, so this shouldn't even be unexpected. Or with an application/json document (under a proposed JSON Pointer regime), http://example.com/document.json and http://example.com/document.json# will refer to the same resource, and so could return either. (Actually this is an example of a media-type specific behavior, but the point holds.)

But certainly, the URI resolution function must not do this, and that isn't behavior that should be in a generic URI library.

annevk commented 11 years ago

Port 80 being the default port has no bearing on relative reference resolution (but yes, browsers do further normalize such URLs and that is defined by the URL Standard, but that's not really the very interesting bits). JSON Pointer does not apply to all JSON resources by default and also has no bearing on relative reference resolution. Furthermore, fragment identifier semantics are local. Resource-wise you'd get the same either way, the fragment identifier is irrelevant for that purpose.

awwright commented 11 years ago

Well yes, but that's beside the point that there can be scheme- and media-type specific normalizations (like how the http/https schemes can be normalized as defined in RFC 2616 3.2.3), that these may be performed in a resolution step, and there is one normalization that is always applied, and it is applied in the URI resolution step.

The issue here is that Node.js fails to do this normalization step in some circumstances.

Aside, the URI defines "The semantics of a fragment identifier are defined by the set of representations that might result from a retrieval action on the primary resource. The fragment's format and resolution is therefore dependent on the media type". As a matter of standard, a fragment on a JSON document URI currently has no meaning except to identify different but otherwise arbitrary resources. The draft currently doesn't even adopt such language that would make that so (even though that's it's clear purpose), but I've been told by the editor that's going to be fixed.

jasnell commented 9 years ago

@domenic ... I know there's ongoing effort on the io.js side to improve the url module conformance and functionality. Is there any update on this particular issue? Do we need to keep this open to continue to track?

domenic commented 9 years ago

Yes, we should keep this open to track.

nodejs / node-v0.x-archive

url: resolve does not canonicalize .s and ..s when switching protocols #5453