Consider removing nextHopProtocol as it may expose whether visitor is using VPN / proxy

kdzwinel commented 4 years ago

Spec states that the value of the nextHopProtocol can be different depending on use of a proxy:

When a proxy is configured, if a tunnel connection is established then this attribute MUST return the ALPN Protocol ID of the tunneled protocol, otherwise it MUST return the ALPN Protocol ID of the first hop to the proxy.

This suggests that a website, having knowledge about the resources being loaded and expected nextHopProtocol values, can detect visitors using a proxy. This could be abused to enforce geo-restrictions and prosecute (in certain parts of the world) users using proxy software.

Since user agent may be unable to determine the safe value of the nextHopProtocol when connection is tunneled, we suggest that this property is dropped.

yoavweiss commented 4 years ago

This suggests that a website, having knowledge about the resources being loaded and expected nextHopProtocol values, can detect visitors using a proxy.

The different protocols that are being reported can vastly differ in their performance characteristics, so I'd imagine that a proxy that wants to be undetected would mirror the protocols that it establishes with the server to the client.

This could be abused to enforce geo-restrictions and prosecute (in certain parts of the world) users using proxy software.

Do we have cases where we see this happening? Can those geo-restrictions distinguish "legitimate" proxies from "illegitimate" ones? ("legitimate" may be AV software, corporate/school/prison environment, etc)

we suggest that this property is dropped

Who's "we"?

pes10k commented 4 years ago

The different protocols that are being reported can vastly differ in their performance characteristics, so I'd imagine that a proxy that wants to be undetected would mirror the protocols that it establishes with the server to the client.

I'm not sure I understand the argument here (if you could rephrase I would appreciate it), but even if the functionality just makes it more difficult for privacy preserving software to be privacy preserving, thats harm in and of itself.

Do we have cases where we see this happening? Can those geo-restrictions distinguish "legitimate" proxies from "illegitimate" ones? ("legitimate" may be AV software, corporate/school/prison environment, etc)

If you're interested in research in this area, you might look at some of the anti-censorship work out of ICSI and Vern Paxton's lab. I'm not aware of any papers that consider proxies specifically (it may exist, I sincerely don't know), but I'm sure we are both aware of regimes that attempt to detect VPN use, bridge-node Tor use, and other privacy focused proxies. Its also easy to find people interested in finding way to detecting proxy use, and people concerned about their proxy use being detected. I'm not claiming that these are sophisticated approaches or replies, only that they demonstrate that some sites want to detect proxy use, and lots of users don't want that to happen.

yoavweiss commented 4 years ago

The different protocols that are being reported can vastly differ in their performance characteristics, so I'd imagine that a proxy that wants to be undetected would mirror the protocols that it establishes with the server to the client.

I'm not sure I understand the argument here (if you could rephrase I would appreciate it), but even if the functionality just makes it more difficult for privacy preserving software to be privacy preserving, thats harm in and of itself.

Let me rephrase: A proxy that is trying to make sure its usage is undetectable through the protocols it supports would need to support all the protocols that origins do. Otherwise, usage of the proxy will be detectable, regardless of the extra signal that nextHopProtocol provides.

If you're interested in research in this area, you might look at some of the anti-censorship work out of ICSI and Vern Paxton's lab. I'm not aware of any papers that consider proxies specifically (it may exist, I sincerely don't know), but I'm sure we are both aware of regimes that attempt to detect VPN use, bridge-node Tor use, and other privacy focused proxies. Its also easy to find people interested in finding way to detecting proxy use, and people concerned about their proxy use being detected. I'm not claiming that these are sophisticated approaches or replies, only that they demonstrate that some sites want to detect proxy use, and lots of users don't want that to happen.

I have no doubt that some sites want to detect VPNs/proxies (e.g. to enforce geo restrictions). VPNs may not be related here, as they could support the same protocols (at least in some architectures). For proxies, I'd expect most detection to rely on e.g. IP addresses.

Relying on nextHopProtocol would be shaky as it could give you the "wrong" protocol in many non-proxy cases: when QUIC gets downgraded, when AV local proxies are involved, when resources are cached or proxied through a Service Worker, etc.

kdzwinel commented 4 years ago

1.

Who's "we"?

Sorry, I should have provided more context. Those three issues (#221, #222, #223) were opened as a result of a privacy review that we (@dharb, @jdorweiler and myself) did at PING.

2.

Do we have cases where we see this happening?

For reference, here is a recent case of prosecuting VPN users: https://techcrunch.com/2020/02/18/indian-police-open-case-against-hundreds-in-kashmir-for-using-vpn/ .

3.

Relying on nextHopProtocol would be shaky as it could give you the "wrong" protocol in many non-proxy cases: when QUIC gets downgraded, when AV local proxies are involved, when resources are cached or proxied through a Service Worker, etc.

Possibly, but I think that many of those have side effects of their own that can rule them out (e.g. AV blocking certain resources, SW caching being visible via workerStart).

4. I think it's important for us to understand the privacy tradeoff and, since spec doesn't explain that, is anyone here willing to explain why we need to share nextHopProtocol information in the first place?

5. We also observed that nextHopProtocol is used as an input by the recaptcha script. It'd be interesting to learn what it's being used for, if anyone here is able to provide that information.

sleevi commented 3 years ago

Some relevant past discussion about this when Blink was shipping this. This general flow (of disclosing potential proxies) has come up in other specifications too.

As mentioned back in 2016, it would be good to document the threat model and boundary here, since there are various ways in which a server can passively or actively probe for the existence of a proxy. For example, mTLS is an example of an explicit probe, which proxies cannot generally terminate (as would HTTP/2 Secondary Authenticators), while H/2 capabilities or, as mentioned in https://github.com/whatwg/fetch/issues/1007 , HTTP capabilities, can offer implicit probes.

LPardue commented 2 years ago

I'm late to the issue and have trying to read up on the context but may have missed things, so apologies if that is the case.

The OP states concern over the tunneled protocol information. But the true intent of tunnelling HTTP is to maintain the end-to-end protocol whatever that might be. For instance, if a browser had configured either a SOCKS proxy, an HTTP proxy (CONNECT over plaintext), or an HTTPS proxy (CONNECT over secure layer) and it initates a TCP connection to the target server (the origin), the client would see the information about the HTTP version selected between Client and Target. This would be no different than the information available if the client connected directly.

The only exception to the above is if such a proxy attempted to manipulate the connection e.g. by changing the TLS handshake.

LPardue commented 2 years ago

The case where a MITM proxy is involved is a bit different from tunneling. It terminates and recreates connections by impostering. In that case the client might not even be aware of the proxy either. I don't know whether we should be spending effort to accommodate such proxies. The protocol information is important in understanding the performance numbers being retrieved from resource timing. Putting the burden of generating that information on to the server side is not free and may be hard to deploy at the same kind of scale that this API already operates at.

sleevi commented 2 years ago

But the true intent

I’m not sure the relevance here, to the spec or implementations? If I use a SOCKS proxy, implementations expose that as the next-hop. That is, they expose the outer encapsulation, not the inner.

The only exception to the above

Or if this spec exposes the outer encapsulation, which it does. Then, with no manipulation, the properties of the connection are exposed.

It would need to be specified - and implementations updated (in ways that tend not to be readily implementable with todays code based - which isn’t to say it’s not doable, but that it would need both consensus and will/effort to do) - to only expose the inner.

But that still wouldn’t address what to do when connecting to an HTTP site over an HTTP proxy - the inner encapsulation information is unavailable.

sleevi commented 2 years ago

The case where a MITM proxy is involved is a bit different from tunneling.

hopefully the above clarifies the confusion, that this isn’t just MITM proxies.

That said, however rightfully distasteful MITM proxies are, they still represent a non-trivial number of users, and unless browsers are willing to say users must trade some of their privacy to use those, a tradeoff that could both make legitimate uses problematic (e.g. fiddler/Charles proxy) and be a regression for existing users, this can’t just be ignored.

LPardue commented 2 years ago

The text says

if a tunnel connection is established then this must be the ALPN Protocol ID of the tunneled protocol

The "tunneled protocol" should always be a version of HTTP. What is the scenario where SOCKS carries HTTP requests that are not tunnelled?

I agree this applies not just to MITMs, the HTTP over plaintext HTTP is a pretty unfortunate case. So I don't think we're a million miles apart in this respect, just weighing up trade offs.

This conversation does make me wonder though about whether the domainLookupStart and domainLookupEnd properties could also leak the presence of proxies. For instance, when an HTTP client has an HTTP proxy configured that supports CONNECT, it is expected that the proxy performs the DNS resolution. Do you have any insight how browsers treat that already? (happy to spin that off into a different ticket).

As you said back in https://github.com/w3c/resource-timing/issues/221#issuecomment-838839928 better documentation of the threat model does sound like something good.

sleevi commented 2 years ago

Do you have any insight how browsers treat that already? (happy to spin that off into a different ticket).

Chrome, at least, will still perform DNS lookups with proxies (e.g. because SOCKSv4)

w3c / resource-timing

Consider removing nextHopProtocol as it may expose whether visitor is using VPN / proxy #221