mikewest / privacy-budget

Other
204 stars 34 forks source link

A privacy budget may not prevent identifiability for all clients #5

Open englehardt opened 5 years ago

englehardt commented 5 years ago

Consider a hypothetical API that returns the same value for 95% of users and a unique value for each of the remaining 5% of users. If we use the entropy definition of a privacy budget, calls to this API may fall within budget, but still uniquely identify 5% of users. If we use a k-anonymity approach to this API we would either need to disable it entirely (since a single call is sufficient), force the users who would return an identify value to return the same value as the 95% of users (which may lead to breakage), or only allow a whitelist of values to be returned (which bundles all of the otherwise unique users in a single group). Have you thought through how to handle such APIs?

As a real-world example of an API that may fall into this category, consider the ability of a script to retrieve a device's local IP address from the the RTCPeerConnection.localDescription property or the RTCPeerConnection.onicecandidate event handler. For most users this will return something low in the private network range, but for a small percentage of users this may return a globally unique IP address (such as those behind a university NAT). For the latter group, that's all that's required to track them across sites.

bslassey commented 4 years ago

Absolutely. There are several ways to do this accounting, but probably the two most straightforward are entropy and k-anonymity. Entropy accounting has the nice property of being predictable for web developers if the entropy "prices" are well known and consistent, while k-anonymity as you have pointed out would be a better direct measurement of identifiability for a given user. Given the trade off, I would lean towards predictability, but I think it will be important to see what the data tells us.

A few other quick thoughts on this front:

  1. We might decide to use entropy for the budget but still keep track of k-anonymity and potentially warn identifiable users.
  2. My goal here is to prevent widespread tracking of users, which if the solution works for 90% of users I suspect will be the case. The situation that really concerns me is if the majority of users have some API that uniquely identifies them, in which case a well crafted script could stay under budget but still be effective at tracking users. If that were to be the case, I think accounting by k-anonymity would have to win out over accounting by entropy and we'll have to combat the unpredictability in other ways.
  3. If there are a few APIs with this property, there will be a strong case for mitigating them in some way.
  4. Assuming entropy is "good enough" for the current state of affairs, I suspect we'll want to try as best as possible to make the entropy of new APIs as close to the inverse of min k-anonymity as possible.