Open thegreatfatzby opened 1 year ago
I agree, this is an interesting question. It's definitely one of the areas where this high-level document doesn't get into enough detail to offer an opinion one way or the other. I tried to highlight this sort of grey area when I wrote
there is room to allow sufficiently useful information to flow in a privacy-respecting way. Both "sufficiently useful" and "privacy-respecting" must be evaluated on a case-by-case basis.
In various conversations I've had in the years of Privacy Sandbox development, some people are skeptical of any system which has the ability to build a user profile based on data from many different contexts, even if that pool of data is under the user's control and can only be used for targeting inside an opaque environment:
On the technical side, some people have pointed out that if the output is used for ad selection, then an ad click inherently means some amount of leakage of information out of the environment. This makes it much harder than protecting measurement use cases, for example.
On the philosophical side, combining data across multiple contexts inherently leads to the ability to make inferences that wouldn't be possible based just on behavior in a single context. This possibility of "novel inferences" is itself a line that some people do not want to cross, even if those inferences are "just" used to pick an ad you see on your own screen. (And of course any leakage vector, including the previous bullet point, means that "just" is somewhat suspect.)
All this is a longwinded way of saying that I don't think there is consensus on the question you're asking, and the document's ambiguity reflects that.
Interesting thoughts.
On the technical side, some people have pointed out that if the output is used for ad selection, then an ad click inherently means some amount of leakage of information out of the environment. This makes it much harder than protecting measurement use cases, for example.
Could you please provide an example of when that would be an issue?
A maybe naïve assumption is that when clicking on an ad, the user expects the ad information (the product the user clicked on) to be leaked out to the environment, even if that ad information comes from different contexts.
For example, if a user, in different contexts, shows interest in healthy drinks and sport, and that user is shown an ad for a healthy sport drink, and clicks on that ad, the user would surely expect to land on a site selling healthy sport drinks.
Examples come naturally from the combination of the "novel inferences" and "click-time leakage" risks.
I think the canonical "novel inferences" example is the famous "Target knows you're pregnant" story. If the on-device ad selection model can be based on items browsed or purchases made across many unrelated sites, it facilitates picking an ad which inherently means "the model believes this person is pregnant".
The chosen ad might not be for something obviously pregnancy-related at all. If, as that NYTimes article says, Target thinks it's really valuable to get the bit of information that you are pregnant, then they could show you a good coupon for something unrelated, but with a click-through URL that says &probablyPregnant=true
.
[Note that the NYTimes article is paywalled. Click here instead to get the story non-paywalled / "gifted" from my subscription... which, of course, means this link now contains tracking information!]
Ok, I see, the ad would be hiding the novel inference, in order to pass the information without the user being in a position to know that the information is being passed...
Some of what I'm chewing on here has to do with the wording of the Attestation, but I'm curious about your thoughts on the model more abstractly. So I'll say explicitly that here I'm not asking for comment on the Attestation, what it requires, etc.
So, thinking about it more the Novel Inferences thing is interesting. I think I think that the idea/concept/threat of "Novel Inference" isn't identical (no pun intended) to the idea/concept/threat of "Re-identification across Contexts". It seems like Novel Inference certainly can include a "complete re-identification" between contexts A and B (max inference), but that B learning something that is tied to your identity in A doesn't imply re-identification.
So, just definitionally, does that seem right?
I can still see how a user (including me) would want to avoid Novel Inference of particular sets of their characteristics from Context A in Context B, like the pregnancy case or if I don't want my insurance company to know about my arthritis. It would be worse if that Novel Inference (arthritis) could be exported and queried in a permanent and clear way attached to my identity in B (insurance industry)...but even if that was transient (say it prevents me from getting a better insurance quote in my stubborn browser) that is bad.
So, then two questions:
To dive in a little bit, I'd like to kick a hackysack around on the quad (or toss a frisbee, your choice) with you and ask your thoughts on what it means to:
On one side of the line, I definitely see Persisting a Graph of Unique Identifiers to an ACID store with RAID 10 disks and global replication via Quantum Entanglement, as quite clearly "joining identities", as you can see the result of the join at your leisure, use it to do further ID based lookups in different contexts, know how to repeat it in the future.
I think on the other side of the line, I can see a transient process mixing data attached to each ID into a single list, assuming that list does not contain the IDs or any deterministic derivative of it, and that list never leaves a the strongly gated process...that one is tougher. If the output of the join does not contain unique identifiers I couuulld argue you've not joined IDs, re-identification can't happen directly (especially given k-anon output gates), and we've just joined user data.
I think that:
Does (1) mean being able to operate on the IDs in a common process in any way? Observing the output of the join, rather than just the input? Can I assume Chrome does not want to take a position on this? :)
OK, this might be going to far from the quad, but like, what does "join", even, like, mean, man? Are we referring to:
It seems like Novel Inference certainly can include a "complete re-identification" between contexts A and B (max inference), but that B learning something that is tied to your identity in A doesn't imply re-identification.
First, I certainly agree that "B learning something that is tied to your identity in A doesn't imply re-identification." This is what I was getting at in my privacy model doc when I wrote "A per-first-party identity can only be associated with small amounts of cross-site information" as one of its big three points. Of course this is where all of the hard questions end up — just to quote my 2019 self a little more:
The fuzziness of "small amounts of information" recognizes the balancing act that browsers need to perform between user privacy and web platform usability. Potential use cases must respect the invariant that it remain hard to join identity across first parties, but subject to that limit, there is room to allow sufficiently useful information to flow in a privacy-respecting way. Both "sufficiently useful" and "privacy-respecting" must be evaluated on a case-by-case basis.
However, in the previous discussion in this issue, I was trying to use the term "novel inference" to mean something a little different: some probabilistic belief about a person that was not being made based on their behavior on any single site, but rather was made only by drawing on information from multiple contexts. That is, it's not about "B learning something that is tied to your identity in A", but rather "B learning something based on your behavior on A1, A2, A3, etc, which was not derivable from your identity on any one of the Ai alone." The fact that it was not previously tied to any of your partitioned-by-site identities is what makes it "novel".
Again, this is surely a different question from that of "joining identity". We know that we don't want someone to be able to join identities across sites — once that happens, the game is lost, and the browser no longer has any hope of preserving the user's privacy or restricting the flow of information across contexts. But if identity is not joined, then the browser does have a chance to be opinionated about the cross-context flow of information. All these other questions are trying to figure out what that opinion ought to be.
As I've gotten deeper into this I've been pondering something: what would be the impact to this core privacy model if user bidding signals were:
- Partitioned in any untrusted or persistent environment
- Viewable and deleteable by a user on their browser
- But could be viewed together in a transient process by a function in an opaque environment such as a TEE, provided the output of that process still had to have DP and K enforced.
I haven't had the chance to try to work through the math here (some serious cobwebs to dust off for any proof'ing) but I wonder if this would still meet the privacy model laid out here from a "happy path perspective" (meaning impact to "reidentification across context"), with the full understanding that any hacks on that environment would incur a worse privacy loss than if a single-partition-process is hacked.
I agree, this is an interesting question. It's definitely one of the areas where this high-level document doesn't get into enough detail to offer an opinion one way or the other. I tried to highlight this sort of grey area when I wrote
there is room to allow sufficiently useful information to flow in a privacy-respecting way. Both "sufficiently useful" and "privacy-respecting" must be evaluated on a case-by-case basis.
In various conversations I've had in the years of Privacy Sandbox development, some people are skeptical of any system which has the ability to build a user profile based on data from many different contexts, even if that pool of data is under the user's control and can only be used for targeting inside an opaque environment:
On the technical side, some people have pointed out that if the output is used for ad selection, then an ad click inherently means some amount of leakage of information out of the environment. This makes it much harder than protecting measurement use cases, for example.
On the philosophical side, combining data across multiple contexts inherently leads to the ability to make inferences that wouldn't be possible based just on behavior in a single context. This possibility of "novel inferences" is itself a line that some people do not want to cross, even if those inferences are "just" used to pick an ad you see on your own screen. (And of course any leakage vector, including the previous bullet point, means that "just" is somewhat suspect.)
All this is a longwinded way of saying that I don't think there is consensus on the question you're asking, and the document's ambiguity reflects that.
As I've gotten deeper into this I've been pondering something: what would be the impact to this core privacy model if user bidding signals were:
I haven't had the chance to try to work through the math here (some serious cobwebs to dust off for any proof'ing) but I wonder if this would still meet the privacy model laid out here from a "happy path perspective" (meaning impact to "reidentification across context"), with the full understanding that any hacks on that environment would incur a worse privacy loss than if a single-partition-process is hacked.