patcg-individual-drafts / private-ad-measurement

Privacy preserving advertising attribution
8 stars 1 forks source link

Support for multiple impression-side keys, and naturally support both count + value queries #5

Open csharrison opened 1 year ago

csharrison commented 1 year ago

Currently the design supports an impression registering a single “key” / ad id that gets mapped to a single histogram bucket. This makes certain use-cases impossible to satisfy, like supporting querying a hierarchy of impression features (see this paper which explains how Google Ads is using hierarchical queries in ARA: https://arxiv.org/pdf/2308.13510.pdf).

A simple example: you might want to measure a hierarchy of {campaign, location, impression day}. You could combine these into a single key, but this effectively “spends” all of your privacy budget on the leaves of the tree, when there are more optimal methods if you care about reporting aggregates at every level of the tree (i.e. getting an accurate campaign count, campaign x location, and campaign x location x day}.

Another use-case is if you just want two different views of the data that don’t form a hierarchy:

In ARA we support this by giving every impression an abstract “contribution budget”, which they can use to contribute across many keys. The underlying constraint is that the L1 sensitivity of the total contribution for a given impression is bounded by some large constant which noise in the aggregation layer is tuned to. Each key can get a separate fraction of the contribution budget, and so get different amounts of effective noise after aggregation.

This is tough to integrate into PAM since it requires listing out all of the Ad IDs that can convert at conversion time. An alternative is to have the conversion specify key labels along with a domain size. This allows you to more succinctly map into the aggregation key domain, especially useful if you want a “one to many” mapping.

Let me propose an alignment with ARA here, to show how this could be done (super simplified):

// Impression registration lists out label -> key mappings
{
  "aggregation_keys": {
    "geo": 0x5,
    "campaign": 0x159
  }
}

// Conversion registration allocates budget consumption to labels. Imagine the total
// budget is 100 and you want to spend more budget on the per-campaign queries.
// Internally, the labels are mapped to the keys listed out at impression reg
{
  aggregatable_values: {
    "geo": 5,
    "campaign": 95,
  },
  // Imagine we have ~100 geos and a domain that fits 400 simultaneous campaigns
  // If any keys land outside of this domain they are ignored.
  histogram_size: 500
}

The above configuration would generate a 500-element vector where index 0x5 has value 5 and index 0x159 has value 95. This is much more succinct than pre-specifying 500 different Ad IDs at conversion time.

Note that this approach is related to the approach mentioned in https://github.com/patcg-individual-drafts/private-ad-measurement/issues/3 which avoids the need for impressions to specify their maximum # of conversions. It also allows for supporting measuring both counts and values, whereas currently IIUC the current PAM design only supports either count measurement or value measurement for a single conversion. Counts in this model can be thought of as just a fixed fraction of the contribution budget (determined by the API user).

Also: this strawman does not support many features that ARA has for specifying histogram buckets (namely, that the bucket can be a function of both impression and conversion-side information). If you are generally OK exploring more use-cases we can iterate on other features along these lines.

winstrom commented 1 year ago

Thanks for the discussion. Since the comment is quite long I will attempt to answer in-line:

Currently the design supports an impression registering a single “key” / ad id that gets mapped to a single histogram bucket.

We propose mapping a single Ad ID to a single bucket. We propose supporting multiple Ad IDs being mapped to the same bucket.

This makes certain use-cases impossible to satisfy, like supporting querying a hierarchy of impression features (see this paper which explains how Google Ads is using hierarchical queries in ARA: https://arxiv.org/pdf/2308.13510.pdf).

A simple example: you might want to measure a hierarchy of {campaign, location, impression day}. You could combine these into a single key, but this effectively “spends” all of your privacy budget on the leaves of the tree, when there are more optimal methods if you care about reporting aggregates at every level of the tree (i.e. getting an accurate campaign count, campaign x location, and campaign x location x day}.

Here I think that we naturally support this use case. As I understand that paper, you would like to construct multiple histograms filled from the same data. In the example, there are 3 histograms:

  1. A histogram binned by campaign

  2. A histogram binned by the outer product of campaigns and locations

  3. A histogram binned by the outer product of campaigns, locations, and days of the week (conversion day).

The Ad ID space can span the most granular key space desired, but each individual histogram can specify its own mapping from the atomic space into its own condensed space. Histograms (1) and (2) would be identically defined for every conversion, but (3) would need slightly more logic. For example, it could be constructed by conversions on particular days of the week delivering their particular Conversion ID so that days of the week are naturally aggregated together. Histogram (1) maps every Ad ID that has the same campaign to the same bin while histograms (2) and (3) map Ad IDs that have the same campaign and location to the same bin.

Each conversion specifies the amount of noise that will be added to the final aggregation independently and can be allocated as the advertiser directs (as you point out in #3, there may be a more elegant way to do this than in our proposal).

Aggregating this over some time period will result in 9 histograms, 7 of which are restricted to one day of the week, which should support the use case described above.

Another use-case is if you just want two different views of the data that don’t form a hierarchy:

  • conversion counts by campaign
  • conversion counts by location

In ARA we support this by giving every impression an abstract “contribution budget”, which they can use to contribute across many keys. The underlying constraint is that the L1 sensitivity of the total contribution for a given impression is bounded by some large constant which noise in the aggregation layer is tuned to. Each key can get a separate fraction of the contribution budget, and so get different amounts of effective noise after aggregation.

Our proposal considers the different bins in each conversion histogram to be used for attributing credit of the conversion rather than a fraction of the overall privacy budget. This is an explicit way to allow multi-touch attribution reporting. We support splitting the budget by using multiple histograms per conversion. In the example above one would construct two histograms at conversion time. One for campaign and one for location.

This is tough to integrate into PAM since it requires listing out all of the Ad IDs that can convert at conversion time. An alternative is to have the conversion specify key labels along with a domain size. This allows you to more succinctly map into the aggregation key domain, especially useful if you want a “one to many” mapping.

There is a brief discussion in our proposal of mapping many Ad IDs per bin. Finding the best ways to support this seems like a good place for discussion. Our initial choice is to limit the information committed to at impression time. We did this for the same reasoning you articulate in #3 — that during “strategy” investigations there are potentially many impressions waiting for conversions and one may want to have different behaviors for different conversions or group advertisements into different bins.

Let me propose an alignment with ARA here, to show how this could be done (super simplified):

// Impression registration lists out label -> key mappings
{
  "aggregation_keys": {
    "geo": 0x5,
    "campaign": 0x159
  }
}

// Conversion registration allocates budget consumption to labels. Imagine the total
// budget is 100 and you want to spend more budget on the per-campaign queries.
// Internally, the labels are mapped to the keys listed out at impression reg
{
  aggregatable_values: {
    "geo": 5,
    "campaign": 95,
  },
  // Imagine we have ~100 geos and a domain that fits 400 simultaneous campaigns
  // If any keys land outside of this domain they are ignored.
  histogram_size: 500
}

The above configuration would generate a 500-element vector where index 0x5 has value 5 and index 0x159 has value 95. This is much more succinct than pre-specifying 500 different Ad IDs at conversion time.

For this particular example, the pre-specification of bin number does reduce the information required at converison time, but also requires a commitment to that information at impression time. Our choice was to allow more flexibility. I will come back to this in a moment to address a subsequent point.

Note that this approach is related to the approach mentioned in #3 which avoids the need for impressions to specify their maximum # of conversions. It also allows for supporting measuring both counts and values, whereas currently IIUC the current PAM design only supports either count measurement or value measurement for a single conversion. Counts in this model can be thought of as just a fixed fraction of the contribution budget (determined by the API user).

In our model, you are right that each histogram only supports either a count measurement or a sum measurement. However specifying multiple histograms for a particular conversion allows sums and counts to be measured and accounted for in the overall privacy budget.

Also: this strawman does not support many features that ARA has for specifying histogram buckets (namely, that the bucket can be a function of both impression and conversion-side information). If you are generally OK exploring more use-cases we can iterate on other features along these lines.

While this strawman may not support combining information between impression time and conversion time, our proposal does. Histogram definition at the time of conversion allows joining of conversion time and impression time information. This choice was made to provide flexibility. New Ad IDs need to be considered along with those already registered in browsers, an advertiser may want to have different measurement logic for repeat customers vs new customers, different products or store fronts may need to use impression data differently, etc.

csharrison commented 1 year ago

Thanks for taking the time to respond in detail Luke. I think what I was missing was that defining multiple histograms across multiple "conversion" events is a "first class" part of the PAM API and we expect that many use-cases will require this. ARA makes a different choice, and I think this is due to our API surface being HTTP-based: A single HTTP response generates a single vector of histogram contributions, and multiple invocations would requires multiple HTTP round-trips (although this is not fundamental).

I am still concerned about the scale of IDs being sent, so agree we can spend some time discussing the mitigations.