Open csharrison opened 1 year ago
Thanks for filing this issue @csharrison. I completely agree with you that there are multiple important, and valuable use cases that a trigger-side “breakdown key” could satisfy. One that immediately comes to mind is publishers who log a “conversion funnel” such as “add to cart”, followed by “initiate checkout” followed by “purchase”. Aggregate statistics about the full-funnel would be very useful.
I would love to hear suggestions for how you’d imagine this integrated, for example do you see this as a Cartesian product (breakdown by source-breakdown cross trigger-breakdown) or something else?
I would love to hear suggestions for how you’d imagine this integrated, for example do you see this as a Cartesian product (breakdown by source-breakdown cross trigger-breakdown) or something else?
Unfortunately I don't think there's necessarily a one-size fits all solution here. In ARA we allow some binary OR mixing + a query with pre-specified keys. However, we are thinking about some general enhancements to this to support use-cases like key discovery where it's really not very practical to enumerate all possible keys, which typically involves a thresholding step.
We don't have anything published on this (yet) but the high level idea is to have a "threshold" parameter which allows you to interpolate between the case where you always output the pre-declared keys (no thresholding. 100% recall assuming your pre-declared space covers everything relevant), and a thresholding mechanism like this one where you have no false positives / 100% precision, but reduced recall due to thresholding.
It's also worth noting that introducing a thresholding mechanism like this one pushes us into the "approximate DP" regime with a delta term.
I was thinking about the privacy of the Cartesian product output of trigger, source breakdown keys. It seems to me that since in IPA we have per user (matchkey) capping, we don’t need to limit the number of breakdown keys from either the trigger or source side. But let me know if this seems correct.
If $T$ is the set of trigger breakdown keys and $S$ is the size of source breakdown keys. Then in attribution we can create the Cartesian product $T \times S$ and use those as the breakdowns keys for aggregation. Additionally, we could use a public mapping from $T \times S \to B$ to combine some of the cross terms and then aggregate by the breakdowns in $B$. For instance
$T$ | $S$ | $B$ |
$t_1$ | $s_1$ | $b_1$ |
$t_1$ | $s_2$ | $b_1$ |
$t_2$ | $s_1$ | $b_1$ |
$t_2$ | $s_2$ | $b_2$ |
The natural privacy concern that arises is that a value in $T$ could be unique to a user and a value in $S$ unique to a user, so the cross term would tell you if these were the same user. But as long as we are applying DP noise to the sum for each breakdown key based on a per user cap and enforcing per user contribution capping, this should be fine. Even if only one user occurs in a final breakdown, they still have the intended DP protections. This means we shouldn’t need to limit the size of $T$ or $S$.
This is unlike the ARA event-level reporting which doesn’t cap across all events of a user and thus needs to limit the size of the trigger breakdown set. Otherwise, you could have several events for the same user using a unique trigger breakdown and several events with a unique source breakdown and have no way to cap that user’s contribution to the final breakdown and protect them with the DP noise.
@bmcase I agree w/ you. The sizes of $T$ and $S$ in this model are not concerns for privacy if the output space is a priori public (using $T x S$ or $B$). The main concern is utility / computation e.g. if the cartesian product is huge. Imagine e.g. an ecommerce site which has ~1M campaigns and ~1M trigger values because they want to annotate which product was purchased from which campaign. A 1T sized output space is very tough to deal with if every bucket needs to be output.
Side note: for ARA event-level, the privacy mechanism is k-ary randomized response which is technically robust to large domains, it just causes the noise to dominate: $\lim_{k \to \infty} k/(k - 1 + e^\epsilon) = 1$.
Side note: for ARA event-level, the privacy mechanism is k-ary randomized response which is technically robust to large domains, it just causes the noise to dominate: $\lim_{k \to \infty} k/(k - 1 + e^\epsilon) = 1$
Thanks for clarifying. That makes sense for event-level reports. How about aggregated reports in ARA, you mention there the trigger breakdown space is large; is this a concern there without user level capping?
Thanks for clarifying. That makes sense for event-level reports. How about aggregated reports in ARA, you mention there the trigger breakdown space is large; is this a concern there without user level capping?
No, the output space is not a concern for aggregate APA for similar reasons to IPA. We ask the developer to pre-specify all possible keys prior to aggregation and just consider those keys in the privacy mechanism: https://github.com/privacysandbox/aggregation-service#collect-and-batch-aggregatable-reports
This ensures the key space is public. If we didn't ask the developer to pre-specify we would have to output 2^128 output buckets to be private which is not feasible.
FYI: we recently published https://github.com/WICG/attribution-reporting-api/blob/main/aggregate_key_discovery.md which is relevant for this topic. It proposes a flexible solution between "declare every possible output bucket" and "use thresholding to ensure 100% precision without declaring any buckets".
It is a very common use-case to query conversion aggregates broken down by information embedded in the trigger/conversion associated with a given attributed impression. For example:
Supporting this use-case will bring more value to IPA, and we can do it without impacting the privacy bar, so we should consider adding support to this in IPA.
We have support in this in ARA in two places:
One caveat here is that the kind of “conversion label” is sometimes a function of both source and trigger-side data. For example, Google Ads has a “selective optimization” feature that allows for labeling a conversion as eligible for bidding optimization based on the campaign. This is a useful technique for reducing the dimensionality of the output (and thus saving privacy budget).
There are many techniques to explore to satisfy this use case, but at a high level, this requires two things:
More work would be needed to generate trigger breakdown keys as a function of source-side data. For reference on how ARA solves this, see our event explainer, whose mechanism works similarly for generating aggregatable reports. At a high level, we designed a simple DSL that allows selecting trigger data / aggregation keys based on source-side information.