Threat model: Privacy of data known in the clear to the report collector

schoppmp commented 1 year ago

A question that came up in the context of #60 is whether the threat model of IPA aims to protect data known in the clear to one of the parties (e.g., the record collector). It seems clear that cross-site data (the match key) and anything derived from it must be protected by DP, but the draft is not explicit about whether this also applies to other inputs. So are protocols that assume some of the inputs are known to all helpers (but not necessarily public) within scope? For randomized response, one could for example think of priors derived from first-party data that can then be used to improve accuracy by restricting the domain of the randomized response (as for example done here).

CC @marianapr @csharrison

csharrison commented 1 year ago

A few other things that (might) be sensitive with respect to the helpers that would be good to know:

Query patterns. E.g. if we supported interactive querying to the system, it might be difficult to avoid the helpers learning the # of queries a given report collector is making to the system.
Choice of privacy algorithm, e.g. if we supported multiple kinds of privacy mechanism (additive noise, RR , etc)

It seems clear we can't hide the total # of reports from the helper party network, but even that could be fuzzed to some extent.

bmcase commented 1 year ago

A question that came up in the context of https://github.com/patcg-individual-drafts/ipa/issues/60 is whether the threat model of IPA aims to protect data known in the clear to one of the parties (e.g., the record collector). It seems clear that cross-site data (the match key) and anything derived from it must be protected by DP, but the draft is not explicit about whether this also applies to other inputs.

In this PR (images render better here for reading) we write out more of the treat model options we are considering for this sort of non-matchkey data (specifically timestamps, breakdown keys, trigger values, caching of ciphertexts vs shares).

So are protocols that assume some of the inputs are known to all helpers (but not necessarily public) within scope?

Can you clarify the difference you're thinking of between "known to all helpers" vs "public"? I think as long as this is about data minimization above and beyond satisfying the core threat model then such designs are in scope. I think one example of this may already be the caching of shares instead of ciphertexts, which allows all the Helpers to see the counts of matchkeys from a (user agent, domain) while preventing this information from being revealed to other Report Collectors (as with this being public).

For randomized response, one could for example think of priors derived from first-party data that can then be used to improve accuracy by restricting the domain of the randomized response (as for example done here).

Is the concern keeping these priors private from other Report Collectors or more broadly being them published for query transparency by the Helpers publicly? One related question I had on this paper's approach is if in a malicious threat model we will need to validate the priors, or if they are given incorrectly does the DP bound still hold and we only lose utility?

For "query patterns" and "choice of privacy algorithm" and "total # of reports", it seems the Helpers would have to see this information. I guess the question is whether to make it further public for query transparency. I don't think we've come down on a preferred approach yet: on one hand it has been suggested that making such information transparent benefits trust in the system overall, on the other hand this sort of data could be business sensitive.

csharrison commented 1 year ago

In this PR (images render better here for reading) we write out more of the treat model options we are considering for this sort of non-matchkey data (specifically timestamps, breakdown keys, trigger values, caching of ciphertexts vs shares).

But that PR is about protecting things from the report collectors. This issue is about protecting things owned by the report collectors. I think the threat models could be different here, but maybe you are saying that the principle of "Data minimization where there is no tradeoff" should apply here as well.

Can you clarify the difference you're thinking of between "known to all helpers" vs "public"

There may be data that is somewhat sensitive but the report collector trusts all of the helpers not to leak it. I think "total # of reports" is the cleanest example here where the scale of the # of reports may:

Be a confidential # that leaks market share
Reveal a non-DP count of user data that is not known publicly to others (i.e. you could think of it as a non-differentially private public data release)

Is the concern keeping these priors private from other Report Collectors or more broadly being them published for query transparency by the Helpers publicly?

I think both. Note also that we aren't leaking the priors in any of these algorithms, just the output space of randomized response which is derived from priors. Still, that could be leaking sensitive user information depending on how this is implemented and whether the semantics of buckets are known.

in a malicious threat model we will need to validate the priors, or if they are given incorrectly does the DP bound still hold and we only lose utility?

No, the priors do not need to be validated for the DP to hold. You can think of the label DP algos that use a prior as just a fancy version of "choice of algorithm" since they just allow flexibility in terms of how the RR algorithm is performed.

bmcase commented 1 year ago

In this PR (images render better here for reading) we write out more of the treat model options we are considering for this sort of non-matchkey data (specifically timestamps, breakdown keys, trigger values, caching of ciphertexts vs shares).

But that PR is about protecting things from the report collectors. This issue is about protecting things owned by the report collectors.

I think I would describe it more as protecting things known to one report collector from other report collectors who may issue queries with those reports. But I agree you raise a question that is not addressed there about information revealed to the Helpers that Report Collectors may wish is kept confidential. Since there isn't user level privacy a risk here, I think these would be addressed using contractual means and not a technical solution.

I suppose long term we could get more fancy if necessary and prevent the Helpers from learning more of this data -- for instance enabling some sort of anonymous query submission where a Report Collector could prove to the Helpers they had budget to run the query but not reveal which site they were so as to limit the Helpers learning about the scale of data queried by different parties. But I'm not even sure this is the most desirable state with so little query transparency.

marianapr commented 1 year ago

Another way I am phrasing this issue is the following. Report collectors (or more specifically the underlying adtechs, advertisers) have some first party data, for example, the priors for RR with prior. This first party data is not private with respect to the entity that owns it and there is no user privacy concern there (but it could be the case that this entity has given its users promises how it will protect and/or share such data). So the question is whether we should be OK with designs that require the adtech/advertiser have to reveal such first party data to other parties involved in the measurement in order to obtain utility from the APIs, or should we designs in a way that does not require such sharing. In the concrete example for RR with prior, we can choose a designs that reveal the priors to the workers or one that hides the prior from the worker (in my opinion the latter is much more desirable).

I will add an agenda item for next PATCG meeting to discuss this.

patcg-individual-drafts / ipa

Threat model: Privacy of data known in the clear to the report collector #77