patcg-individual-drafts / ipa

Interoperable Private Attribution (IPA) - A Private Measurement Proposal
Other
35 stars 17 forks source link

Privacy budgets #3

Open martinthomson opened 2 years ago

martinthomson commented 2 years ago

Any application of differential privacy is dependent on understanding the sensitivity of the system. If we are going to provide some defensible definition of the privacy properties of IPA - at least in terms of our chosen rubric of differential privacy - we need a solid understanding of the scope of information the system ingests so that we might understand what a contribution from a single person might look like.

The goal from the outset is to reach some sort of conclusion about the total information release about each user, over each epoch (read: week), to each site (advertiser/publisher/etc...) that uses the API, for all the queries that might be made (both trigger and source queries). There are multiple dimensions to this, with each having different considerations.

We're looking for input on a general strategy of bounding contributions from individuals so that we might use a more generous $\epsilon$ value when applying noise to results. As we add dimensions along which we cannot limit contributions directly, we want to see if there are some reasonable assumptions we might make regarding total contribution so that we don't have to shave the overall $\epsilon$ too finely and increase noise.

Multiple Queries and Breakdowns

The noise added by DP protections can be reduced if different noise is contributed over multiple queries.

The first major split in IPA queries is between source and trigger queries. Each site is able to make each type of query independently, with source queries only containing source events for that site and trigger queries only containing trigger events for that site. This immediately adds a factor of 2 to the information release or sensitivity of the system.

Multiple queries of the same type are another major feature. Sites need to be able to query the same data different ways. The current design aims to manage the risk by providing sites with a fixed $\epsilon_e$ budget that can be expended over multiple queries. This is mostly effective in the sense that each query can then expend a finite amount of budget. Sites making the queries can manage how much noise they can tolerate on each query.

This budget cannot be infinitely finely sliced. Given the nature of the system, we are probably looking at something less than ideal $(\epsilon, \delta=0)$-differential privacy. We likely need a $\delta$ value that is non-zero. That means that there will need to be finite limits on queries to avoid the overall value of $\delta$ growing very large over multiple queries.

Multiple breakdowns over the same events look very much like separate queries from this perspective, so while we might get some greater assurances when it comes to accountability (and accounting!), for analysis purposes we can consider them to be multiple queries.

This approach basically assumes that every user is present in every query. From a privacy perspective, this means that it forces careful limiting of queries and ensures that the total query count is reflective of the privacy cost. There's a big caveat there for the grouping construct, but I'll take that up on a different issue.

We have discussed using geolocation (country/jurisdiction) as a sharding key that would allow a site to effectively split their $\epsilon_e$ budget so that they can make independent queries. That too can be taken up on a different issue.

Overall, how the design manages multiple queries seems manageable.

Users and Devices

Though IPA attempts to provide cross-device/cross-platform attribution, nothing forces a site to use a functional match key provider that would enable this. In the extreme case, a user might contribute events from multiple devices in a way that is known to a site, but not visible to IPA itself for use within budgeting.

Here, we might simply need to make some assumptions about the number of devices on which a user might present.

Note here that it is probably not sufficient to assume that activity will be restricted to a subset of a user's devices. If we consider an adversary that is motivated to recover more information about a user than we wish to reveal, then they can use the API to generate information for any device that sees any use. IPA only requires that sites are visited before an event can be generated. So any assumption needs to capture the number of devices that see use in any given epoch.

Sites

A more difficult aspect of this that each site makes its own queries independently. This is a critical usability feature of the design: we don't want a situation where the queries one site makes causes another site to be able to make fewer queries. So each site operates entirely independently.

This makes sense from a functional perspective, but as a practical matter sites are able to join user identities in a number of ways. The use of primary identifiers (email, phone numbers, federated identity, even just usernames) makes it possible for sites to link user identity across sites. Sites that have a shared understanding of user identity can potentially use that to amplify the information IPA provides. Multiple sites that make queries for the same users relative to a target site might be able to reduce the effective noise in results, allowing them to learn something about those users on the target site. Think of this as being able to average out the differential privacy noise through multiple queries, except that the multiple queries are split between sites and therefore not obviously accountable.

Time

A site is going to want to make queries over multiple epochs. No useful system can be built on a premise of an absolutely finite bound on the information release.

From a privacy perspective, the continuous release of information has always been the most difficult thing to come to terms with. This means that sites that dedicate their use of a measurement system can - given sufficient time - overcome any differential privacy protections we apply.

As a practical matter, it might be best to simply accept and acknowledge this limitation. We can then concentrate our efforts on describing the time period over which information is released. Then we can concentrate on choosing an overall target $\epsilon$ for a period of a year or six months or whatever time period we deem appropriate. That value can then be translated into an equivalent value for any arbitrary period. Though any time period can be chosen, using a benchmark period will ensure that we can talk about it without constantly qualifying it.

Combining Different Dimensions

This means we have the following factors to consider:

As far as I can see, each of these is independent. So - aside from queries, where we have a means of direct control - we might reasonably assign each a value and multiply them out to produce a single scaling factor. Then we can decide a global $\epsilon_G$ target value over which this is split. Then - if we go with Gaussian noise and can scale on $L^2$ sensitivity - the budget for a site over each week might then be something like:

$$ \epsilon_e = \frac{\epsilon_G}{\sqrt{n_d n_t n_s n_e}} $$

Effect of (Poor) Assumptions

Any assumptions we make will inevitably not hold universally. It is also important to understand how privacy might be degraded if assumptions are violated.

To take some extreme examples, we might assume that users will use 100 devices, which is high enough that very few people would ever find that to be a problem. If we scaled $\epsilon$ accordingly, the extra privacy cost to a user with 101 devices might be insignificant.

On the other hand, an assumption of $n_d=2$ devices might give us a more generous $\epsilon_e$ but this limit might be exceeded very often. Although using multiple devices is uncommon, there are some who have quite a bit more than two devices. So, while users who use 8 or more devices are a small minority, it might not be acceptable for them to suffer privacy loss that is double (or more) than our baseline assumptions.

Any assumptions we make will naturally be violated over time, which gives us a means of converting a violation of assumptions into something more meaningful. The increased information exposure that results exceeding an assumed limit might then be expressed as a reduction in the time it takes to reach the global $\epsilon_G$ target. The inverse is also true: if fewer than the assumed number is used, then the time extends.

Feedback Needed

Is this a reasonably comprehensive survey of the options? Is it reasonable to make some assumptions about sensitivity rather than describe absolutes?

Is this composition approach reasonable? Are these properly independent and without any means to exploit correlations to improve the per-epoch $\epsilon_e$ value?

Are there values that we might use for these assumptions? What principles might you apply in choosing those numbers?

Are there ways we can deal with the common cases where far fewer than the assumed limit is used? (My sense is that the answer is "no", but no harm in asking.)

benjaminsavage commented 2 years ago
  • Number of devices for the same user $n_d$

I agree with your analysis that in the worse case an adversary can select a match-key provider which does not provide any cross-device user-linkage. So we cannot count on this happening all the time.

That said, in practice different match key providers will provide differing levels of cross-device user linkage leading to potentially more conservative per-user contribution capping.

Do you think it's possible that we could attempt to somehow measure the varying protection each match key provider provides, and modulate the amount of random noise we have to add accordingly?

Imagine that somehow we had histograms showing us how many devices the P25, P50, P75, P90, P95 user had according to match key providers 1, 2, 3, 4 and 5. Imagine that match key provider 1 has double the number of "devices per user" compared to match key provider 5, all across that histogram.

Would it not make sense in this case to add less random noise to queries which utilize match key provider 1, that to queries which utilize match key provider 2?

  • Number of potentially coordinating sites $n_s$

In Issue #10 - you proposed the following idea:

What we could do is have the delegation to a report collector give the report collector the ability to include more than one site in queries. The result would be that the budget for those sites is pooled.

It seems to me, that to the extent multiple (potentially colluding) sites decide they want to explicity work together to share a pooled privacy budget, that we could (somewhat) relax our estimate of $n_s$, and wind up adding a bit less random noise.

This would provide a nice incentive for sites which potentially can link user identity to transparently operate as a single entity on a single privacy budget. That feels like a good outcome to me.

csharrison commented 1 year ago

“Privacy budgets” is a really broad issue so let me know if this should be separated out. This comment outlines some of the thinking I expressed at TPAC and our PATCG face to face.

I think we should consider enforcing differential privacy at multiple levels simultaneously. This issue is primarily focused so far on what the “tightest” enforcement should look like, but I’d also like to introduce auxiliary enforcement with narrower scopes. For instance, bounding the privacy loss of any one particular conversion. The primary purpose would be to allow making stronger privacy statements regarding our protection of individual user actions, vs. a user’s total aggregate contributions.

If we support different vendors applying different total privacy budgets, this could be used to keep privacy loss tight in the narrow dimension. For example, vendor A and B could apply the same event-level privacy budgets, but B could allow 2x more privacy loss for the holistic privacy unit.

While it doesn’t improve worst case privacy loss, it does protect specific events better when they are generated by an honest caller (e.g. not duplicated, etc.) which is meaningful.