Underscore should not be used for serialization.

naskooskov commented 8 years ago

The spec calls for using the underscore character ('_') as a delimiter for serializing suborigins and notes that it is invalid character in DNS names. While it might be true, depending the reading of the DNS specs, it is actually a valid character that works in practice. I've created an artificial example to demonstrate the problem:

$ dig _test.netsekure.org ... ;; ANSWER SECTION: _test.netsekure.org. 2806 IN CNAME netsekure.org. netsekure.org. 2806 IN A 96.47.74.180 ...

$ dig us_test.netsekure.org ... ;; ANSWER SECTION: us_test.netsekure.org. 2938 IN CNAME netsekure.org. netsekure.org. 2802 IN A 96.47.74.180 ...

$ curl http://us_test.netsekure.org boo!

While curl fails resolution of http://_test.netsekure.org/, it does load successfuly in Chrome Dev channel and Edge.

Furthermore, it isn't something that is isolated to artificial DNS names, we have at least one report from Chromium user about suborigins causing problems - https://crbug.com/617588.

The spec should consider using a different delimiter, which avoids having compatibility problems with existing sites/working code on the web.

devd commented 8 years ago

hmm .. I thought it was invalid for hosts in URIs rather than DNS. I am surprised it works in Chrome.

sleevi commented 8 years ago

As @naskooskov points out, of all things, underscores are perhaps the worst character as a delimiter, as they're in wide use, and application software fails to filter.

Examples of further bugs/weirdness in Chrome regarding underscores (e.g. they're permitted in all of these different layers, and as such, have become used) https://bugs.chromium.org/p/chromium/issues/detail?id=496472 https://bugs.chromium.org/p/chromium/issues/detail?id=496468 https://bugs.chromium.org/p/chromium/issues/detail?id=463410

You can see this issue coming up in TLS certificates, such as https://cabforum.org/pipermail/public/2016-April/007210.html / https://cabforum.org/pipermail/public/2016-April/007244.html

The assumption that underscore works only works if you assume hostnames will always follow the A/AAAA record / hostname preferred name form, but that's not really true in the real world. Most operating systems provide extension hooks for name resolution - whether systems like Bonjour, NetBios, nsswitch - that you simply cannot rely on DNS rules nor URL rules to give you a safe delimiter that doesn't cause compatibility issues.

sleevi commented 8 years ago

For a Firefox discussion, https://bugzilla.mozilla.org/show_bug.cgi?id=1136616 - Note that Firefox itself doesn't prohibit underscores in DNS name, nor does IE, nor does Safari - which is why these names have existed.

The LDH rule doesn't forbid them either, they're just not the "preferred name syntax".

This gets even messier when you talk CNAMEs vs A/AAAA (see https://cabforum.org/pipermail/public/2013-August/002062.html for more historical discussion)

devd commented 8 years ago

We do need some delimiter though. Is there a delimited that works?

I am also curious why this is breaking so spectacularly. I am happy to say "you don't get suborigin protections if you use underscors in your subdomains"

sleevi commented 8 years ago

Is there a delimited that works?

Don't smuggle it in the hostname? Fundamentally, it's a question of whether the view of Web Browsers is that the only naming system is DNS, in which case, it simplifies @annevk 's work with the URL Spec, or whether browsers plan to continue to do what they have always historically done, which is to be agnostic about the underlying naming system in play (whether it be WINS/NETBIOS, /etc/resolve.conf, the hosts file, mDNS/Bonjour, nsswitch, etc).

Even if you assume DNS, you have to decide whether to use the preferred name syntax (modified, of course, by the real world) - which I should note, I'm not aware of any implementation that enforces this syntax in the browser space - or whether you want to use the full DNS space. In DNS, it's an 8-bit protocol - there's no unused place to smuggle in records.

I am also curious why this is breaking so spectacularly.

Because you picked a delimiter that happened to be in wide use?

devd commented 8 years ago

But even if it is https://suborigin_example.com, why is it breaking? Is the implementation currently assuming that anything with _ is always in a suborigin? We can track that separately on the document; wdyt @metromoxie ?

sleevi commented 8 years ago

@devd I'm not sure I understand your question "why is it breaking". Are you talking about the Chrome bug? Talking about Curl? Talking about the server interactions it causes? You'll need to be more specific.

devd commented 8 years ago

I think this is more for @metromoxie to answer, but my point is: what should happen is the suborigin foo on www.example.com should be treated as "same-origin" with the actual domain foo_www.example.com and thus break the security of suborigins if www.example.com was relying on it. But why is a page executing on foo_www.example.com breaking?

sleevi commented 8 years ago

That's already answered on the bug @naskooskov mentioned

joelweinberger commented 8 years ago

It's it the bug, but for those who don't care to wade through Chrome's bug tracker, it's because when Blink sees the serialized Suborigin, instead of deserializing foo_bar to the host foo_bar, it deserializes it to host of foo and suborigin of bar. So it's not that Suborigins are broken, it's that regular host URLs are broken.

@sleevi, it shouldn't leave the Web Browser and Server (or at least the Web Browser, Server, and related plugins), so I guess I see less of a problem here of "smuggling" it in the host name. For example, Chrome should not be making network requests "to" a host that has a serialized suborigin in it, so I don't think it's relevant what the underlying naming system allows. If it does hit the network, that's a bug.

At the very least, it seems clear that a simple "" delimeter is no good, and a leading "" is also probably not good. There's always the original suggestion of "$", which I always kind of liked, but someone thought was ugly. I, frankly, don't care :-)

sleevi commented 8 years ago

$ of course has issues with WINS ;)

The concern is not with hitting the network, it's with any code that parses URLs and does exactly what you're doing - seeing foo_bar and converting it to host foo with suborigin bar, when the author intended the host foo_bar.

joelweinberger commented 8 years ago

Specifically, when that happens with an HTTP[S] protocol. Any other time it will be just passed along.

So far you suggested not serializing it in the hostname. Do you have any suggestions for what you think should be done instead?

sleevi commented 8 years ago

@metromoxie But you can use mDNS/WINS/etc with HTTP[S](ok, it's even more gross for HTTPS, because SNI uses DNS's preferred name syntax, but implementations inconsistently enforce it)

I don't know enough about your design constraints and tradeoffs to make a constructive alternative suggestion; I'm merely trying to educate on what appears to have been an unexpected/unknown consequence of the current design.

bsittler commented 8 years ago

I suspect this all depends a lot on which parts of the possibly-spec-noncompliant-but-actually-slightly-used fringe parts of the web you are hoping to avoid breaking. For example, I believe IE and Edge already forbid cookie operations on hostnames containing underscores. So while such sites certainly exist, they may well be significantly broken already. A double-hyphen pattern analogous to the IDNA xn-- might work, as all such patterns other than xn-- are (I believe) reserved and currently disallowed in DNS registrations. This also brings to mind the question of how suborigins interact with IDNA, but perhaps this is a non-issue since they aren't used as locations (still, it could add complexity to any UI needing to render the suborigins human-readable if the suborigins are in the same dotted segments as the IDNA prefix.)

How about xs--foo.example.com ? I think it's very unlikely to be widely used.

sleevi commented 8 years ago

as all such patterns other than xn-- are (I believe) reserved and currently disallowed in DNS registrations

It wasn't prohibited by the original IDNA work - https://tools.ietf.org/html/rfc3490#section-5 - merely certain strings were 'disallowed', and the 'disallowed' is "SHOULD NOT", and applies to registration.

This was updated by https://tools.ietf.org/html/rfc5890#section-2.3.1 which defines "reserved LDH labels"

Meaning, of course, that there's nothing prohibiting someone from doing xs--whatever.example.com. AFAICT, it also means complexity if you're trying to set a sub-origin on xn--punycode.example.com; Presumably you'd need to encode as xs--suborigin[some delimiter prohibited in suborigin names]_xn--punycode.example.com, and then you also get into issues such as systems that hane restrictions on label length (63 characters) and how they enforce this.

To the extent that's probable, I'm not sure, but it speaks to the same issue as using $ or ~ or ! as a delimiter, rather than _ - which is squatting on parts handled by/defined by other RFCs. I'm not fundamentally objecting, even to $ but I think it needs to be made as a deliberate, considered choice, and knowingly documented as such.

bsittler commented 8 years ago

Correct, and thanks for linking to the authoritative sources there. I meant to propose xs--suborigin., which avoids the punycode-recoding and segment length problems by putting the suborigin in its own segment. It does still break if you need punycode for the suborigin itself, but I hope you don't because that gets really ugly 😉

joelweinberger commented 8 years ago

Hi all. I'd like to revive this discussion since, well, I'm back from paternity leave and am ready to actually start addressing this :-)

I think I'm not understanding some of the complexity in the last few comments, but let's start with xn--suborigin${name}$ (where {name} is the actual suborigin namespace) as a strawman prefix. What are our concerns at this point? I'll list what I think some of them are, and why I think they may not be terrible. I'd appreciate any tear-up of it you all care to make.

It could collide with out-in-the-wild uses of xn-suborigin${name}$ in the wild, which (like the original bug), we'd deserialize incorrectly, and be sad about. We can count how many occurrences we see in the wild today of this prefix, so we can get a sense of how destructive it would be, but it would indeed be breaking and it would have to be a strongly considered decision.
Some underlying systems have length limits, which this might cause us to go past. I don't think this is an issue, though, as long as suborigins don't exit Blink and hit the underlying system?
Adds complexity re: already-existing puny-code hostnames, as well as if we need punycode in the suborigin itself. None of these seem like things we can't deal with, though.

sleevi commented 8 years ago

Well, xn-- is reserved. Did you mean xs--? :)

The point was that anything with -- in the third and fourth label is reserved by the IETF. It would seem unacceptable for the W3C to conflict with that.

annevk commented 8 years ago

Given https://url.spec.whatwg.org/#host-parsing a space or some such is probably fine. Any reason we don't change the scheme? Presumably this only works over HTTP anyway.

arturjanc commented 8 years ago

While https://crbug.com/617588 is a problem and needs fixing, I have to say I don't fully understand the complexity of deciding on the delimiter and the difficulty in making progress here.

Suborigins should almost never be visible to developers or users because you can't make a request to a suborigin -- the fact that a document is in a suborigin is controlled exclusively by the response header. Whenever a request is being sent by a browser, it should be agnostic of suborigins, i.e. a loading foo_example.com should make a DNS request for the foo_example label just like previously. (The risk of a SOP clash of example.com with Suborigin: foo and a hostname of foo_example.com is minor and can be easily pushed off to developers who want to use Suborigins to change such hostnames).

It seems reasonable to have a serialization of suborigins and it's useful in a few cases (e.g. for postMessage: https://w3c.github.io/webappsec-suborigins/#unsafe-postmessage-receive). Assuming that Chrome can fix their logic when making requests to not be confused by the suborigins serialization, is there still a problem with keeping _ as a separator? If so, would one of the reserved characters in https://url.spec.whatwg.org/#host-parsing be better?

sleevi commented 8 years ago

@arturjanc The problem lies exactly with your statement that suborigins should have a serialization. This naturally follows that they need a deserialization, and the whole discussion so far has been how to ensure that the deserialization is unambiguous.

The discussion of network requests is incidental, and a symptom, rather than the bug itself, of having an ambiguous serialization/deserialization.

arturjanc commented 8 years ago

This makes sense. My underlying concern (poorly expressed) is that to suborigins the value of having a serialization in the first place seems fairly low; it could almost be user-agent-dependent if not for the fact that the serialization is specified to be visible to developers in the origins of messages sent/received with postMessage, and probably to avoid confusion in stuff like document.domain.

So while I understand the reasons for rethinking suborigin serialization, the outcome doesn't seem likely to significantly improve the spec. That is, in the absence of crbug/617588 we could likely go on our merry way with "_" as the delimiter and the sky wouldn't fall (the problems you're talking about would of course still exist, but in practice they wouldn't affect suborigin adoption and behavior).

As usual, it's' up to @metromoxie and @devd to decide on the best course of action, and perhaps the answer is to do as you suggest; I'm just worried that it will extend the already protracted process of speccing out suborigins without too much of a tangible benefit. With that said, I'm happy to let y'all duke it out here...

joelweinberger commented 8 years ago

@sleevi Yes, I meant xs--, and yes, I also misunderstood your point that these are IETF reserved namespaces.

@annevk Ah, so you're suggesting that basically any character from the list in step 5 of https://url.spec.whatwg.org/#host-parsing is probably acceptable because today, spec compliant user agents should be giving syntax errors anyway?

To your second question, I looked into encoding in the scheme, but the usability and implementation complications seemed overall tougher. Certainly in Chrome there is a lot more code that assumes all protocols are one of a short whitelist of protocols, and certainly that there isn't a dynamic/unbounded list of them. In general, it seemed more logical to encode it in the already-dynamic host than the usually-static protocol.

No matter what we choose, it's going to basically have to be something which, in practice, we don't see in the wild. I'll brainstorm over the options tomorrow, pick one, and then we'll have to measure what, in practice, is OK.

annevk commented 8 years ago

@metromoxie yeah, they should result in a fatal error. I'm still not quite sure why the scheme cannot be used if this is only for serialization. Sure, there's safelists, but they won't be hit by this. At least thus far whenever we have origin comparisons they're either serialized (i.e., string) comparisons or object comparisons. Presumably in origin/suborigin objects we wouldn't put it in the scheme, but in some field.

joelweinberger commented 8 years ago

@annevk I'm happy to talk about moving it to the port again, but we really did try a while back, and it didn't seem like the better answer for a number of reasons, so I'd rather talk about that in a separate issue.

I'm going to update the draft to use a serialization of $suborigin$hostname. Running some quick experiments, all of the major UAs already do not allow a start char of $, and this places nicely with the URL spec.

In practice, since this doesn't involve hitting the networking stack, I think this is similar to the cookie prefix approach where, as long as in practice we don't see hostnames like this, it seems like the UAs should be safe to do adopt this. And given that I believe all of the major UAs reject URLs of this form currently, that appears to be safe. Feel free to lay down your disagreements, though ;-)

annevk commented 8 years ago

You mean that new URL("https://$test/") throws? Seems fine then, although I’m still skeptical about issues with schemes.

annevk commented 8 years ago

Btw, I suggested scheme, not port. Port has all kinds of parser limitations.

annevk commented 8 years ago

$ is not restricted so I'm not sure why we are going with that? I tend to agree with @sleevi that extending host is fragile. I don't see any such issue with scheme.

joelweinberger commented 8 years ago

I meant scheme not port in my post earlier, sorry :-)

I'm not sure how I messed up with $ there; I guess I was seeing what I wanted to see when I looked at the URL spec. As I said before, we tried scheme earlier, and I found it difficult to implement, and I think we also ran into some trouble with applications that were confused by non-HTTP/HTTPS schemes.

That having been said, as you point out, there aren't a lot (any?) usable characters that are reserved, so I'm not seeing a lot of options with host serialization. I'm going on vacation for the week, so I'll ponder over the scheme syntax, something like https$suborigin://hostname.

sleevi commented 8 years ago

@metromoxie Knowing Chrome's URL implementation, it may be useful to consider https-so://suborigin.hostname - Use the scheme as the signal to treat the first label as a sub origin.

This should resolve the ambiguity issues with embedding it in https:// schemed resources, while avoiding the complexities of needing to make the scheme parser accepting of any input. It also avoids any special character delimiters (beyond the existing host label separator), and makes it simple to test if the serialization 'leaks' past any abstraction boundaries. It should also make it considerably simpler for you to implement within //src/url in Chrome.

annevk commented 8 years ago

That sounds reasonable. Thanks @sleevi.

mikewest commented 8 years ago

@sleevi's suggestion seems reasonable.
If postMessage is the only place where we're actually worried about the serialization (because developers are presumably doing string checks against message origins), I wonder if we could simplify this (for some value of "simplify") by making the serialization something absurd (like the empty string), and defining a new Origin primitive alongside URL that we could hang properties off of.

joelweinberger commented 8 years ago

It's also needed for CORS (see https://metromoxie.github.io/webappsec-suborigins/#cors-ac), but I suppose you can make the same argument? I guess my main objection is that having a serialization for postMessage/CORS allows APIs that are just checking "is the origin the same" to "just work" without retrofitting. For example, there's no reason a postMessage API today that simply checks that two requests are coming from the same origin should need to be retrofitted to suborigins per se.

annevk commented 8 years ago

I don't really see the reason to have an object-representation of origins. If the only use case is suborigins a new serialization seems much more straightforward.

mikewest commented 8 years ago

Let's take the origin object discussion elsewhere; for this bug, I think we have a good path forward in @sleevi's https-so://suborigin.hostname suggestion.

joelweinberger commented 8 years ago

Indeed. I'm working on implementing it just to sanity check that I'm not missing any complexity.

naskooskov commented 8 years ago

I'm just weary of devs needing to parse URLs, as there have been multiple examples of them getting them wrong. Exposing an actual property of what the browser believes the suborigin is, sounds much more robust IMHO.

joelweinberger commented 8 years ago

Currently, we have both, so it's not strictly either/or (and, in fact, that's why we also have the "opt-outs" for unsafe serialization, so you can stay backwards compatible, and effectively promise to explicitly check the suborigin property when a security check is needed.

joelweinberger commented 7 years ago

This has been addressed by the newest serialization, so closing.

w3c / webappsec-suborigins

Underscore should not be used for serialization. #38