patcg-individual-drafts / ipa

Interoperable Private Attribution (IPA) - A Private Measurement Proposal
Other
36 stars 17 forks source link

How do Privacy Budgets work in IPA? #78

Open bmcase opened 1 year ago

bmcase commented 1 year ago

How do Privacy Budgets work in IPA?

with Ben Savage and Martin Thomson

Summary:

Private Scope in IPA

Privacy budgets are a new thing for the web. In the world of 3rd party cookies sites learn information with full confidence about specific individuals' activity on other sites. Applying differential privacy to what sites learn from IPA queries enables us to limit how much information a site can learn about any specific individual.

The goal of IPA is that each site has a budget per epoch on the amount of information that can be learned about people who interacted with the site during that epoch. Our best approximation of a “person” in the IPA system is the matchkey, so more specifically IPA proposes that each site has a budget per epoch on the amount of information that can be learned about a given matchkey’s interaction with the site during that epoch.

At a high level, IPA allows sites to request encrypted matchkeys for source or trigger events occurring on their site. Sites can then add attributes to these reports (e.g. values to trigger events and breakdown keys to source events) and share them with other sites, also called Report Collectors. At any time, a Report Collector can take a batch of source and trigger reports and submit a query to the MPC to get an attribution measurement on these events. With each query the Report Collector must specify how much of its per epoch budget it wants to spend on that particular query. The Report Collector also specifies for each query the per matchkey sensitivity cap to be enforced by the MPC. The cap and budget allocated to this query together determine the parameters of the noise that is applied to the outputs such that the information released about each matchkey is at most the query’s budget.

Queries for different types of Report Collectors

There are different types of Report Collectors who will need to submit IPA queries. See our What is a “Report Collector?” explainer for more details, but the main classes we are working to support are Self-Attributing Publishers, Self-Attributing Advertisers, Ad Networks, and MMPs.

Source fan-out queries for Self-Attributing Publishers

A Self-Attributing Publisher is a site that runs their own ads and collects source reports for them. They also collect trigger reports from Advertiser Websites/Apps. IPA enables these sites to submit source fan-out queries, which consist of source reports from only that source site along with trigger reports from any number of Advertisers sites.

The budget to be spent on a source fan-out query is deducted from the source site’s per epoch budgets but not from the budgets of the trigger sites. More specifically, if a source fan-out query has source reports from multiple epochs, each of those epoch’s budgets for the source site is reduced by the amount to be spent on that query.

It is the Helper Parties who run the MPC queries that are also responsible for enforcing the privacy budgets. They are responsible for checking several things about each submitted query. Recall that the encrypted matchkeys have authenticated associated data with them that contains the site that requested the encrypted matchkey, the epoch when it was requested, and whether it was requested for a source or trigger event.

For source fan-out queries, the Helper Parties check that

  1. All the source reports in the query correspond to the Report Collector that is submitting the source fan-out query.
  2. All the source reports in the query come from the set of epochs specified in the source fan-out query.
  3. The Report Collector has available budget for every epoch that was specified in the query

source-fan-out

Trigger fan-out queries for Self-Attributing Advertisers

A Self-Attributing Advertiser is an advertiser site that is large enough to perform its own ad-measurement in-house. They collect source reports from the publishers they buy ads from. IPA enables these trigger sites to submit trigger fan-out queries, which consist of trigger reports from only that trigger site along with source reports from any number of publisher sites or ad networks.

The budget to be spent on a trigger fan-out query is deducted from the trigger site’s per epoch budgets but not from the budgets of the source sites. If a trigger fan-out query has trigger reports from multiple epochs, each of those epoch’s budgets for the trigger site is reduced by the amount to be spent on that query. In practice, trigger fan-out queries likely just include reports from the most recent one or two epochs; source fan-out queries might look back several epochs for longer attribution windows.

For trigger fan-out queries, the Helper Parties check that

  1. All the trigger reports in the query correspond to the Report Collector that is submitting the trigger fan-out query.
  2. All the trigger reports in the query come from the set of epochs specified in the trigger fan-out query.
  3. The Report Collector has available budget for every epoch that was specified the query

trigger-fan-out

Queries for MMPs

“Mobile Measurement Partners” or MMPs are another example of a current “Report Collector”. They help advertisers perform conversion attribution queries across multiple publishers / ad-networks, and have the ability to perform cross-publisher attribution (including multi-touch attribution). In IPA MMPs run trigger fan-out queries on behalf of Advertiser Apps / Websites. This is nearly identical to the case of self-attributing publishers, with the only difference being that the responsibility of running queries has been delegated to the MMP. The Advertiser Apps/Website enables the MMP to submit IPA queries on its behalf and spend its privacy budget.

One MMP who is a service provider for many Advertisers won’t be able to combine budgets from multiple advertisers. They will see and spend from the budgets of all their different trigger sites with separate trigger fan-out queries for each.

MMP

Queries for Ad Networks

Ad Networks show ads across a large number of publisher apps / websites on behalf of many Advertiser apps / websites. They will need to collect reports about source and trigger events in order to submit IPA queries.

AdNetworks

We are still exploring what the best options are for supporting privacy budgets for Ad Networks. We are considering two design proposals right now but would be open to additional constructions that would give good privacy protections for end-users.

  1. Design Proposal 1: Ad Networks have to work within the above structure of budgets being given to sites but have the additional ability to run source fan-out queries that contain source reports from many source sites. In order to run with multiple source sites in a query, budgets of all the included source sites would need to be spent.
  2. Design Proposal 2: Ad Networks themselves get budgets allocated to them and sites make commitments to only allow a certain set of Ad Networks to run queries with reports generated on their site.

Design Proposal 1 (no custom support added for Ad Networks)

In this proposal Ad Networks have to work (for the most part) within the earlier constraints of running source and trigger fan-out queries on behalf of the websites they work with. However, in the previous settings of the Self-Attributing publisher, the helpers verified that all source events originated from the same source site. This is not possible for ad networks as impressions are shown across many sites. In order to support source fan-out queries involving source events shown across many source sites, we could imagine adding support for source queries across multiple sites - and simply deduct from the privacy budget of all included sites.

Since sites generally work with many Ad Networks, this would lead to sites needing to delegate partial amounts of their budget to the different Ad Networks they work with. How might sites delegate their budgets?

  1. One idea would be to do in proportion to the number of ads shown from a particular Ad Network. However, if you only showed one ad on a site and get a tiny fraction of the budget, you wouldn't be able to include that report in a query with reports from sites that delegated larger budgets. (Since the query deducts equally from all budgets, the minimum budget available becomes the query budget).
  2. A second idea would be to standardize a maximum number of Ad Networks a site would work with and then assign them all the same budget from that site. This would make it easier for Ad Networks to run source fan-out queries across many sites.

Design1

In summary, managing the partitioning of privacy budgets across multiple ad networks would be very complex to manage. Worst case, it could push the ecosystem towards consolidation.

Design Proposal 2 (separate budgets for Ad Networks)

We consider an additional way of supporting Ad Network budgets. Instead of fixing the budget for a site and letting that be delegated towards Ad Networks, we have considered the idea of allowing each source site to delegate to a limited number of ad networks who would each have a constant-sized, cross-web privacy budget.

In this design, the total privacy loss is proportional to the number of ad networks that the user is exposed to rather than the number of sites they visit. For a user that visits relatively few sites, this could be worse, but for users that visit a modest number of sites, the set of ad networks they are exposed to could be less than the number of sites. The privacy loss in that case would be reduced and might not increase further as the user visits more sites (assuming they have delegated to the same set of ad networks the user previously encountered).

Assumptions:

This proposal would essentially reduce the Ad Network case to the same situation as the Self-Attributing publisher case:

In order to implement this second design, we would need the browser to bind the source reports to the ad network that is displaying the ad on the publisher’s website, in addition to the publisher’s site. To do this we would need the following:

  1. Sites would commit to using a particular set of Ad Networks in a way the Helper Parties can verify.
  2. The getencryptedmatchkey() API will have an additional boolean parameter, delegated, which if false will tell the browser to bind this report to only the top-level domain. If true, the report will be bound to both the top-level domain as well as the current (frame) context (here we assume the ads being shown by Ad Networks are in iframes that correspond to a domain operated by that Ad Network).
  3. When the Ad Network calls the getencryptedmatchkey() API in the iframe of the ad they will supply the delegated parameter as true and the browser will create the report and bind it to both the site and the Ad Network.
    1. If the Ad Network calls the API with false, then they get back a report bound to the top-level site. Since this top-level site is one which has decided to delegate queries, any report bound only to this site will be rejected by the Helpers and never leak any information about the user.
  4. When the Ad Network submits a query with this report, the Helper Parties will verify that the site has committed to delegating to this Ad Network.
    1. This way if reports are transmitted out of band to an Ad Network the site has not committed to delegate to, that network will not be able to use it.

Comparison of Ad Network Designs

The following figure illustrates a comparison of privacy budgets between the two designs.

design_comparison

Here is a table that compares the main two designs considered so far.

table2

Open Questions for discussion:

bmcase commented 1 year ago

Here are the slides we presented on this issue at the June PATCG

bmcase commented 1 year ago

I'd like to follow up on the discussion we had a couple weeks ago on this issue; we got good feedback that we'll need to map out Ad Networks in another level of detail with the sorts of queries that each party (e.g. SSPs, DSPs) would care to run. Several folks @alextcone, @AramZS, @tgreasby, @csharrison had some thoughts on how to do this and at least a couple were willing volunteers to help.

Can I suggest that we start async and that anyone willing to take a stab at writing out their understanding of what different parties would want can post to this issue? If we need to iterate with much feedback we can also get into a doc, but more likely we can put this on the next PATCG agenda for end of July to continue discussing.

A couple considerations towards the solutions I'd like to mention:

  1. for queries that involve only one Advertiser's data, (maybe a DSP needing to measure ad performance?) these are probably fairly straightforward if we can spend just the trigger site's budget in a (delegated) trigger fan-out query.
  2. for queries that involve some intermediary party that can't be verified by the browser as having been a domain involved in showing the ad (a DSP needing to do bid optimization??), it might be tricky to support them having their own budget (as in design option 2) as the device won't have a way to bind the report to their budget on-device.
tgreasby commented 1 year ago

I think these are the main parties using the data today for measurement and optimization (loosely based on the lumascape). I am sure I missed some of the players that need the data as well so everyone should feel free to chime in.

Buy side: Advertisers Ad Agencies (advertiser likely use more than one) Ad Servers Measurement companies (e.g., MTA vendors) DSPs

Sell side: SSPs Publishers

I am less familiar with the sell side. What did I miss?

alextcone-google commented 1 year ago

On the buy side there's also the agency (or independent) trading desk (and trader).

On the sell side there are publisher ad servers. That said, like SSPs, often they are not set up to optimize to trigger events coming from advertiser sites.