A privacy budget may not prevent identifiability for all clients

mikewest / privacy-budget

Other

204 stars 34 forks source link

Consider a hypothetical API that returns the same value for 95% of users and a unique value for each of the remaining 5% of users. If we use the entropy definition of a privacy budget, calls to this API may fall within budget, but still uniquely identify 5% of users. If we use a k-anonymity approach to this API we would either need to disable it entirely (since a single call is sufficient), force the users who would return an identify value to return the same value as the 95% of users (which may lead to breakage), or only allow a whitelist of values to be returned (which bundles all of the otherwise unique users in a single group). Have you thought through how to handle such APIs?

As a real-world example of an API that may fall into this category, consider the ability of a script to retrieve a device's local IP address from the the RTCPeerConnection.localDescription property or the RTCPeerConnection.onicecandidate event handler. For most users this will return something low in the private network range, but for a small percentage of users this may return a globally unique IP address (such as those behind a university NAT). For the latter group, that's all that's required to track them across sites.

Absolutely. There are several ways to do this accounting, but probably the two most straightforward are entropy and k-anonymity. Entropy accounting has the nice property of being predictable for web developers if the entropy "prices" are well known and consistent, while k-anonymity as you have pointed out would be a better direct measurement of identifiability for a given user. Given the trade off, I would lean towards predictability, but I think it will be important to see what the data tells us.

A few other quick thoughts on this front:

We might decide to use entropy for the budget but still keep track of k-anonymity and potentially warn identifiable users.
My goal here is to prevent widespread tracking of users, which if the solution works for 90% of users I suspect will be the case. The situation that really concerns me is if the majority of users have some API that uniquely identifies them, in which case a well crafted script could stay under budget but still be effective at tracking users. If that were to be the case, I think accounting by k-anonymity would have to win out over accounting by entropy and we'll have to combat the unpredictability in other ways.
If there are a few APIs with this property, there will be a strong case for mitigating them in some way.
Assuming entropy is "good enough" for the current state of affairs, I suspect we'll want to try as best as possible to make the entropy of new APIs as close to the inverse of min k-anonymity as possible.

mikewest / privacy-budget

A privacy budget may not prevent identifiability for all clients #5