bmcase commented 1 year ago

Clarifying Google’s sparse histogram use case for PAM

I’ll open this issue here on PAM, but it is more of a question directed to @csharrison. Charlie, I’d like to clarify my understanding of the use case you are trying to solve for with the sparse histograms that you talked about in the PAM ad hoc call and expressed the need for having a very large # of adIDs sent to the device for the mapping table used to generate Advertiser reports.

In showing ads on the open web, there are three use cases that seem like they could be related to what you’re looking for (informed by Ben Savage’s experience with Meta’s Audience Network):

We want to provide the Advertiser measurement of conversions across their ads. What we don’t think Advs want to see is a breakdown of how many conversions resulted from ads shown on each of the 10,000s of sites their ad was shown on. Rather what they want to see is the results grouped by more actionable breakdowns like different creatives they have used. In this case, we don’t see the need for the mapping table sent to the device to be huge – it can just be on the order of actual distinct creatives or ads the Adv wants to measure.
An Adv may actually want to see a breakdown of how many times their ad was shown on each of the 10,000s of sites for the purpose of brand safety (the Adv caring to know their ad is’t appearing on sites they don’t want their brand associated with). But this problem can be solved without any cross-site measurement because we don’t need to consider if there was a conversion and the ad network can just count how many impressions from the Adv are shown on each site and give the Adv this report.
The third use case that really gets interesting is to support an ad network that needs to understand the post-click conversion rate for different site_ad pairs. What happens often is that the same ad shown on different sites may lead to much different rates of deep funnel conversions. This is because some sites are poorly designed and result in a lot of accidental clicks. The ad network needs to know this about different sites to incorporate into their bid model telling the Adv how much they should bid to show their ad on different sites. What we want for calibrating this is to see a breakdown of how many conversions result from every site_ad pair – that may be too many breakdowns to get good signal/noise ratio so we could coursen ad to ad_set or even put similar sites together so we have enough traffic to measure.

My understanding is you’re trying to solve something like this 3rd use case using Advertiser reports which is why you need to ship down a set of adIDs roughly the size of the # of sites. I’ve been thinking about how you might be able to solve for this 3rd use case using PAM publisher reports and keeping this huge mapping off the device. Luke clarified that doing what we usually call “late binding” of breakdown keys to publisher reports seems reasonable in PAM. In fact PPM has an issue to potentially support just this in PRIO in letting shares come with labels and then the query is just to aggregation all things with the same label.

I think this can let us solve the 3rd use case in the following way:

Suppose you have a show Adv showing an ad campaign with an ad network. The ad network will put the same adID on every impression though they are displayed on many 10,000s of sites. Then the Adv’s conversion mapping contains just one adID.
Attribution happens on the device – let’s say everything is 1-day click through. Then for impressions after 1 day they return their report to the ad network. To support the late binding each ad impression has a nonce that was placed in the impression and retired in the report (doesn’t leak anything cross-site). Then the ad network can group these impression reports by site_ad_set where ad_set can be a coarser grain than the ad_id sent to the device (maybe up to the campaign level). They could also combine multiple sites into the same bin. Once they have received enough traffic for this bin they can supply all these reports to the MPC to aggregate and get this site_ad_set noisy total of conversions.

Charlie, can you clarify if these are the use cases you’re trying to support or if there is a further complex use case? Luke, if you see something about this construction PAM couldn’t support please let me know.

simon-friedberger commented 1 year ago

Afaiu large amounts of IDs are intentionally prevented by most proposals to prevent their usage for tracking. If they are necessary for some use-case their leakage should be analyzed.

csharrison commented 1 year ago

Thanks for filing this issue @bmcase . So for the "sparse" histogram case the prototypical use-case of publisher breakdowns (documented in https://github.com/WICG/attribution-reporting-api/issues/583) can be solved with publisher reports as you describe.

I want to emphasize two things though:

My more major concern in the F2F meeting was around "dense + large" histograms, rather than the truly sparse case (where I believe sketching techniques will also work). We can use this issue to discuss sparse histograms in PAM, but it is not my primary concern.
Publisher reports are not always an ideal solution for reporting because they may have unacceptable loss/delay trade-offs. We've heard this feedback from ARA event-level reports which have similar architecture that the delay is very hard to manage to solve the reporting use-case.

bmcase commented 1 year ago

@csharrison thanks for clarifying. I agree that delays for publisher reports are a concern.

My more major concern in the F2F meeting was around "dense + large" histograms, rather than the truly sparse case (where I believe sketching techniques will also work).

I would think that "dense + large" should also be able to supported through publisher reports as described above. Was there a reason besides delays that you were thinking we'd need to use Advertiser reports for the "dense+large" case?

csharrison commented 1 year ago

I would think that "dense + large" should also be able to supported through publisher reports as described above. Was there a reason besides delays that you were thinking we'd need to use Advertiser reports for the "dense+large" case?

Hm that's a good question. It's kind of hard to answer given that delays are so important for the publisher reports. Even if delays were reduced and you could requery across multiple windows (like ARA event-level reports supports), it might require composition / more noise. Maybe there could be a better solution here though!

My impression is that for non-optimization use-cases, advertiser reports are more natural so that's what I was focusing on.

patcg-individual-drafts / private-ad-measurement

Clarifying Google’s sparse histogram use case for PAM #9

Clarifying Google’s sparse histogram use case for PAM