Add a mechanism to support breakdowns by trigger data

csharrison commented 1 year ago

It is a very common use-case to query conversion aggregates broken down by information embedded in the trigger/conversion associated with a given attributed impression. For example:

The “type” of conversion e.g. an enum like a purchase, email signup, etc
A signal about whether the conversion should be an input into a bidding optimization algorithm
Finer grained breakdowns on things like purchase cart contents (product categories, etc)

Supporting this use-case will bring more value to IPA, and we can do it without impacting the privacy bar, so we should consider adding support to this in IPA.

We have support in this in ARA in two places:

In event-level reports, we have a “trigger data cardinality” of 2 for views and 8 for clicks. This allows event-level reports to embed 1-3 bits of trigger-side metadata.
In aggregatable reports, we allow the aggregation key (similar to IPA’s breakdown key) an arbitrary amount of trigger-side data.

One caveat here is that the kind of “conversion label” is sometimes a function of both source and trigger-side data. For example, Google Ads has a “selective optimization” feature that allows for labeling a conversion as eligible for bidding optimization based on the campaign. This is a useful technique for reducing the dimensionality of the output (and thus saving privacy budget).

There are many techniques to explore to satisfy this use case, but at a high level, this requires two things:

Some mechanism to "mix" breakdown keys across contexts (e.g. binary OR, etc.)
Some mechanism to query mixed breakdown keys (e.g. automatically querying the cartesian product, callers directly specifying the output domain, some thresholding mechanism for sparse histograms, etc)

More work would be needed to generate trigger breakdown keys as a function of source-side data. For reference on how ARA solves this, see our event explainer, whose mechanism works similarly for generating aggregatable reports. At a high level, we designed a simple DSL that allows selecting trigger data / aggregation keys based on source-side information.

benjaminsavage commented 1 year ago

Thanks for filing this issue @csharrison. I completely agree with you that there are multiple important, and valuable use cases that a trigger-side “breakdown key” could satisfy. One that immediately comes to mind is publishers who log a “conversion funnel” such as “add to cart”, followed by “initiate checkout” followed by “purchase”. Aggregate statistics about the full-funnel would be very useful.

I would love to hear suggestions for how you’d imagine this integrated, for example do you see this as a Cartesian product (breakdown by source-breakdown cross trigger-breakdown) or something else?

csharrison commented 1 year ago

I would love to hear suggestions for how you’d imagine this integrated, for example do you see this as a Cartesian product (breakdown by source-breakdown cross trigger-breakdown) or something else?

Unfortunately I don't think there's necessarily a one-size fits all solution here. In ARA we allow some binary OR mixing + a query with pre-specified keys. However, we are thinking about some general enhancements to this to support use-cases like key discovery where it's really not very practical to enumerate all possible keys, which typically involves a thresholding step.

We don't have anything published on this (yet) but the high level idea is to have a "threshold" parameter which allows you to interpolate between the case where you always output the pre-declared keys (no thresholding. 100% recall assuming your pre-declared space covers everything relevant), and a thresholding mechanism like this one where you have no false positives / 100% precision, but reduced recall due to thresholding.

It's also worth noting that introducing a thresholding mechanism like this one pushes us into the "approximate DP" regime with a delta term.

bmcase commented 1 year ago

I was thinking about the privacy of the Cartesian product output of trigger, source breakdown keys. It seems to me that since in IPA we have per user (matchkey) capping, we don’t need to limit the number of breakdown keys from either the trigger or source side. But let me know if this seems correct.

If $T$ is the set of trigger breakdown keys and $S$ is the size of source breakdown keys. Then in attribution we can create the Cartesian product $T \times S$ and use those as the breakdowns keys for aggregation. Additionally, we could use a public mapping from $T \times S \to B$ to combine some of the cross terms and then aggregate by the breakdowns in $B$. For instance

$T$	$S$	$B$
$t_1$	$s_1$	$b_1$
$t_1$	$s_2$	$b_1$
$t_2$	$s_1$	$b_1$
$t_2$	$s_2$	$b_2$

The natural privacy concern that arises is that a value in $T$ could be unique to a user and a value in $S$ unique to a user, so the cross term would tell you if these were the same user. But as long as we are applying DP noise to the sum for each breakdown key based on a per user cap and enforcing per user contribution capping, this should be fine. Even if only one user occurs in a final breakdown, they still have the intended DP protections. This means we shouldn’t need to limit the size of $T$ or $S$.

This is unlike the ARA event-level reporting which doesn’t cap across all events of a user and thus needs to limit the size of the trigger breakdown set. Otherwise, you could have several events for the same user using a unique trigger breakdown and several events with a unique source breakdown and have no way to cap that user’s contribution to the final breakdown and protect them with the DP noise.

csharrison commented 1 year ago

@bmcase I agree w/ you. The sizes of $T$ and $S$ in this model are not concerns for privacy if the output space is a priori public (using $T x S$ or $B$). The main concern is utility / computation e.g. if the cartesian product is huge. Imagine e.g. an ecommerce site which has ~1M campaigns and ~1M trigger values because they want to annotate which product was purchased from which campaign. A 1T sized output space is very tough to deal with if every bucket needs to be output.

Side note: for ARA event-level, the privacy mechanism is k-ary randomized response which is technically robust to large domains, it just causes the noise to dominate: $\lim_{k \to \infty} k/(k - 1 + e^\epsilon) = 1$.

bmcase commented 1 year ago

Side note: for ARA event-level, the privacy mechanism is k-ary randomized response which is technically robust to large domains, it just causes the noise to dominate: $\lim_{k \to \infty} k/(k - 1 + e^\epsilon) = 1$

Thanks for clarifying. That makes sense for event-level reports. How about aggregated reports in ARA, you mention there the trigger breakdown space is large; is this a concern there without user level capping?

csharrison commented 1 year ago

Thanks for clarifying. That makes sense for event-level reports. How about aggregated reports in ARA, you mention there the trigger breakdown space is large; is this a concern there without user level capping?

No, the output space is not a concern for aggregate APA for similar reasons to IPA. We ask the developer to pre-specify all possible keys prior to aggregation and just consider those keys in the privacy mechanism: https://github.com/privacysandbox/aggregation-service#collect-and-batch-aggregatable-reports

This ensures the key space is public. If we didn't ask the developer to pre-specify we would have to output 2^128 output buckets to be private which is not feasible.

csharrison commented 1 year ago

FYI: we recently published https://github.com/WICG/attribution-reporting-api/blob/main/aggregate_key_discovery.md which is relevant for this topic. It proposes a flexible solution between "declare every possible output bucket" and "use thresholding to ensure 100% precision without declaring any buckets".

patcg-individual-drafts / ipa

Add a mechanism to support breakdowns by trigger data #55