Discuss the privacy of monitoring

jyasskin commented 4 years ago

@snyderp reported https://github.com/WICG/crash-reporting/issues/1, https://github.com/WICG/deprecation-reporting/issues/1, and https://github.com/WICG/intervention-reporting/issues/1 saying that sending debugging information to websites is a privacy harm. Whether or not that's a consensus position, this document should discuss it.

michaelkleber commented 4 years ago

I've added a comment https://github.com/WICG/crash-reporting/issues/1#issuecomment-571289525 explaining why I strongly disagree with the premise of Pete's issue. I quite agree that the Threat Model should discuss this sort of API, but based on everything in the model so far, the clear conclusion is that this is not a threat.

jyasskin commented 4 years ago

Based on @snyderp's https://github.com/WICG/crash-reporting/issues/1#issuecomment-571300343, it sounds like the harm would be something along the lines of "unwanted use of client resources". I do think that's a real harm, but

1) I don't see how it fits into the high-level privacy threats described in RFC 6973. If it doesn't, is it really a "privacy" harm? 2) What can we say about how to trade off this harm vs API benefits? Something about the amount of resources should go into the decisions, but the reporting APIs mentioned above seem like they'll use much smaller amounts of resources than, say, images, most Javascript frameworks, or advertisements. Assuming that server-chosen images are here to stay, I'm having a hard time inventing a coherent target for the threat model that excludes the requests made by the proposed reporting APIs.

michaelkleber commented 4 years ago

I agree that there are huge problems with "unwanted use of client resources" — the current prototypical example is sites or ads that mine bitcoin in your browser. So I'm all in favor of a statement of principles that sees value in reducing client resource usage.

From that point of view, an API like https://github.com/w3c/IntersectionObserver, which saved a measurable percentage of total web browser battery use, is a great win. But I'll note that none of the PING discussion of that API has been on this aspect, and indeed has been quite hostile to that point of view when I've raised it.

This reinforces the conclusion that the threat we're discussing is not actually one of "privacy".

pes10k commented 4 years ago

There are two related, but distinct concerns. The "unwanted use of client resources" issue is tangental.

The proposal describes functionality that allows websites to learn things about the user and the user's environment thats sites currently cannot learn. The proposal does this w/o user consent, in a way that involves no user notification or opt in (in fact, the authors say this is necessary b/c too many people would deny consent if asked!). This is privacy harm; other parties can learn new things about me they can't currently learn.
The reasoning behind this proposal isn't directly user-serving (its browser instrumentation designed to help site owners solve site owner problems). This makes the privacy harm more egregious, but is distinct from the privacy harm itself.

@jyasskin RFC 6973 shouldn't have the final say on such things, but this falls clearly in "6.2. User Participation" (among others). A user visits a website to achieve a user goal. None of that is related to "I want to help the site debug its application".

michaelkleber commented 4 years ago

Ah great, glad to get back to the core issue: "other parties can learn new things about me they can't currently learn."

What is it that these APIs allow learning about you?

I think we all agree that they let people who write code learn about how that code fares in the real world, and I guess we can just disagree about whether a user's post-OOM experience is "I couldn't accomplish my goal today, but I would still like to be able to accomplish it tomorrow." But until we relate the API to information about the user, I don't see the privacy angle.

pes10k commented 4 years ago

Setting aside the other issues being discussed above and the parallel WICG issue, honest question, have you read section 6.1 of RFC 6973 and 6.2. It does a good job explaining (part of) the concern here.

Do you disagree that this proposal is contrary to sections 6.1. (and 6.2, among others) or do you disagree these are useful floors for thinking about privacy?

michaelkleber commented 4 years ago

I like that RFC a lot! My reading of §6.1 in this context is:

Data minimization can be effectuated in a number of different ways, including by limiting collection, use, disclosure, retention, identifiability, sensitivity, and access to personal data.

The kind of data we're talking about, like the "Did this crash come from an OOM?" bit, is not "personal data". The RFC definition (in §3.2) says that personal data is "Any information relating to an individual who can be identified, directly or indirectly." The reporting here doesn't provide a way to tie the report to an individual. And indeed §6.1 says

However, the most direct application of data minimization to protocol design is limiting identifiability. Reducing the identifiability of data by using pseudonyms or no identifiers at all helps to weaken the link between an individual and his or her communications.

That's exactly what the monitoring API does, by design.

pes10k commented 4 years ago

I think you skipped over the main point! The very first item in 6.1 is "Data minimization refers to collecting, using, disclosing, and storing the minimal data necessary to perform a task". This text only makes sense if we're discussing tasks the user wants to accomplish, not other parties (considering 6.1 alongside 6.2, emphasizing user consent, control and information, makes this even more plain). To understand "task" in this text as "any given task" would render the text meaningless (e.g. "we're using the minimal amount of data needed to cross site track" is not a meaningful privacy protection). Sending crash reports to unknown (to the user) parties is not related to the task the user intended to perform, and so does not meet the privacy principals in that RFC.

Whether or not the spec sends minimal information for a task unrelated to the goal the user is trying to perform is not relevant (at least to the concepts of privacy described in that RFC).

If the claim is "debugging the site is a task all users intend to perform", that seems… extremely unlikely, and worth explicitly asking about.

michaelkleber commented 4 years ago

The user wants to perform a task. And indeed they tried to do so. And failed!

So the OOM failure is extremely clear evidence that the developer did not already have enough data to enable the task the user just tried to do.

pes10k commented 4 years ago

You're conflating things. If I want to drive on a road, and I can't because the road is full of potholes, its not a sign that I want to help fill potholes, its a sign the people maintaining the road are doing a bad job. Likewise, if I visit a site and the site is busted, its not a sign I want to help fix the site, its a sign the site builders have not finished their task. The way to distinguish is to ask.

michaelkleber commented 4 years ago

Let's flesh out the analogy.

You want to drive on a road. The road is full of potholes. For most people in the world, those potholes cause their car to bump up and down, and it's fine. Your car has the precise resonance frequency so that the potholes cause it to fall apart. (Every road has such a car.)

The people who make the road have heard rumors of some cars having problems, so they want to set up a camera that watches for where cars fall apart, so they know what to fix. They want to take reasonable steps to protect privacy: the camera is built so that it cannot record license plates or driver or even car color, just where a car fell apart.

Nobody is asking you to help fill potholes, just to let the road owner look for where they do damage.

You are advocating for a switch in the glove compartment that says "Make my car visible to pothole damage monitors." That is a way to ensure that most potholes remain in place and driving is worse for everyone.

jyasskin commented 4 years ago

Overall, I'm trying to identify principles that this document can explain, so that API developers can apply those principles to new APIs without the PING's involvement. It's absolutely true that RFC 6973 is not the final word on all privacy principles, and we can add new principles in this document if we think it's missing some.

In RFC 6973, section 6 is about ways to mitigate privacy harms. It doesn't claim that designers need to apply it in cases where there isn't a privacy harm. However, there might be some implicit privacy harms we could extract from section 6 that the authors didn't realize needed to be explicitly listed in section 5.

So, let's look at 2 things that @snyderp mentioned to see if we need to add a new principle to the threat model's high-level threats section:

"The proposal describes functionality that allows websites to learn things about the user and the user's environment that sites currently cannot learn.": Learning things about the user is covered by "disclosure". Information about the user's environment is not covered except insofar as it's also information about the user. I think @snyderp isn't arguing that facts about Chrome 83 are private, since he also argues that the site could learn those facts without involving the user. If the reporting APIs do reveal private information about the user, an objection doesn't need to discuss whether it's necessary to achieve a user goal: the complaint should just identify the information that's revealed and discuss the appropriate user choice mechanism.
"Data minimization refers to collecting, using, disclosing, and storing the minimal data necessary to perform a task.": RFC 6973 doesn't say that non-private data needs to be minimized, but we could say that in this document if it's something the PING believes. If we did, however, I think we'd be saying that all advertising is against the threat model, since even direct-sold ads send data that's not integral to the user's task. I don't think the PING thinks ads are fundamentally against the web's privacy model? Is there a way to describe this principle that draws a line between things we don't like and things we think are ok?

tildelowengrimm commented 4 years ago

I think that information about my system is always potentially information about me. Every time you pick up a smidge of information, you know that much more which could help you recognize me even when I take steps to conceal other identifiable characteristics. Learning about a person's environment always helps you track them.

pes10k commented 4 years ago

@michaelkleber I think this analogy might have taken on a life of its own. But if your suggestion is that pervasive car monitoring is privacy preserving, ya dun goofed. Better to leave road/site maintainers with the responsibility for debugging their stuff, and let others volunteer info if they want to.

@jyasskin I second @tomlowenthal's comment, and I don't think focusing on “Crash Reporting API” is the best basis to bang out PING privacy principles, but the short of it is that the API shares information about the user’s experience and environment (which is unavoidably about the user) w/o user consent, knowledge or expectation, and thats a problem.

The larger issue about whether "all data is fair game for sites to collect unless there its immediately, one-hop useful for identifying the user" is comparable with user-respecting, privacy-by-default system design seems better to hash out in its own issue / in a PING call / etc.

jyasskin commented 4 years ago

There was some ambiguity in what I wrote about a user's environment, so I want to distinguish a couple different kinds of facts a server might learn to see if we can pinpoint where we disagree:

User 12345 is using Brave 1.2.3, which crashes on https://server.example/page.
User 12345 is using Brave 1.2.3 with some unspecified extra OS configuration that causes crashes on https://server.example/page.
Some user, without identifying which one, is using Brave 1.2.3 with some unspecified extra OS configuration that causes crashes on https://server.example/page.
A user at IP 1.2.3.4, without identifying which one, is using Brave 1.2.3 with some unspecified extra OS configuration that causes crashes on https://server.example/page.

I see these as having different privacy implications:

This is information about the user, but it's included in the user agent string, which is probably in the category that we'll agree is always going to be exposed. It could be filtered based on the requested Client Hints.
This is also information about the user, and if the server has tested their page with enough OS configurations, they might learn something about the user that isn't included in the User Agent, often that they're using Terrible Antivirus v73. This kind of report should be banned by the target privacy threat model.
This isn't information about the user, since the server can't tell which user it's about. It should be allowed.
This is information about a small set of users, which should also be banned.

There may be some implicit assumptions that (3) is unachievable, but it's at least achievable by routing the request through Tor.

w3cping / privacy-threat-model

Discuss the privacy of monitoring #9