Better understanding IPA "report collector" coordination needs

alextcone commented 2 years ago

I want to deepen my understanding of report collector set up coordination requirements when there are 2 or more report collectors or report collector delegates. I suspect I'm not the only one who may have these questions. I hope to use what I learn to make suggestions for updated documentation and potentially design.

Scenario

For the sake of clarity, I'm going to use real companies names in my hopefully somewhat real-life scenario. I'm going to interchange between saying "it is my understanding" and ending sentences with "correct?" to indicate areas where IPA authors may feel obliged to affirm or correct me.

Let's say Dyson is running a campaign using GCM360 as its general advertiser ad server and Sizmek Ad Suite where it must. Dyson's agency divides the total campaign budget evenly between three DSPs:

The Trade Desk (to buy video + banner inventory)
DV360 (to buy video inventory)
Amazon (to buy product ads inventory)

The Trade Desk serves Dyson video and banner creatives (all hosted/measured in GCM360) on nytimes.com, cnn.com, wsj.com, weather.com and howtocleanstuff.net via SSPs representing those sites. It is my understanding The Trade Desk and GCM360 can call get_encrypted_match_key() just by nature of delivering ad code to the page for these sites. However given VAST is delivered as XML, there will be no way to call it on those creatives. Likely the site would need to pass the value through the video player in a macro. I don't see or expect anything stopping sites from doing that.

DV360 serves video creatives (hosted/measured in GCM360) youtube.com and cnn.com. GCM 360 and DV360 will run into the same VAST XML issues above calling get_encrypted_match_key() so will need youtube.com and cnn.com to pass its value through the video players on those sites.

Amazon serves product ad creatives (hosted/measured in Sizmek by Amazon) on amazon.com and howtocleanstuff.net. via Amazon's SSP. Again, it is my understanding Amazon can call get_encrypted_match_key() just by nature of delivering ad code to the page for these sites.

GCM360, The Trade Desk, DV360 and Amazon (including Sizmek) all have code on the Dyson landing page to log potential conversion events and therefore can call get_encrypted_match_key(), correct?

The following table lays out source and trigger report collector opportunities based on the understanding above.

	Source Collector Delegate for	Trigger Collector Delegate for
GCM360	nytimes.com, cnn.com, wsj.com, weather.com, youtube.com, howtocleanstuff.net	dyson.com
The Trade Desk	nytimes.com, cnn.com, wsj.com, weather.com, howtocleanstuff.net	dyson.com
DV360	cnn.com, youtube.com	dyson.com
Amazon	howtocleanstuff.net, amazon.com	dyson.com

Each of these entities needs to run both source and trigger fan out queries (where they can based on the table above) to understand campaign performance to understand campaign performance for Dyson, correct? Where each of these entities has both source and trigger data, they can individually select whatever helper party network for the epoch, correct?

Suppose nytimes.com, youtube.com and amazon.com are interested in running source fanout queries. There might be several entities wanting to run them. In the case of nytimes.com I suspect the SSPs who delivered ads? Presumably Google runs source fanout queries for youtube.com and Amazon for amazon.com. Do each of those entities need to be running their source reports into the same helper party network GCM360 used, who saw all of what happened on dyson.com? Additionally, whichever of these entities wants to query will also need to align on the attribution constraint ID with at least GCM360 (the advertiser's ad server for all instances outside of amazon.com), correct?

There are clearly a lot of actors wanting to understand what is happening:

Dyson and its ad buying/measuring technology providers
Each of the websites listed and their monetization technology providers

My understanding of the current design (July 29, 2022) is that pre-campaign coordination may be necessary to optimize privacy budgets, helper party network selection and Attribution Constraint IDs. I may be wrong, but I hope this somewhat detailed and reality-based scenario will help us all understand a bit more clearly where scale/adoption/utility issues exist and might be addressed.

benjaminsavage commented 2 years ago

Thanks @alextcone for filing this issue. I think working through a specific example is a great way to get more clarity about how this is intended to work =).

The Trade Desk serves Dyson video and banner creatives (all hosted/measured in GCM360) on nytimes.com, cnn.com, wsj.com, weather.com and howtocleanstuff.net via SSPs representing those sites. It is my understanding The Trade Desk and GCM360 can call get_encrypted_match_key() just by nature of delivering ad code to the page for these sites.

Basically yes. However, as @michaelkleber pointed out on the last PAT-CG call, we should probably have some kind of a "permission policy" so that nytimes.com, cnn.com, wsj.com, weather.com and howtocleanstuff.net can specify which scripts are, and are not allowed to run get_encrypted_match_key() on their respective sites. I think the way this would work, would be to run The Trade Desk's ad-tag inside of an iframe, and use an attribute like this: https://www.w3.org/TR/permissions-policy-1/#iframe-allow-attribute

However given VAST is delivered as XML, there will be no way to call it on those creatives. Likely the site would need to pass the value through the video player in a macro. I don't see or expect anything stopping sites from doing that.

I agree. When there is no way for The Trade Desk to execute scripts (e.g. we are talking about video ads), it should be possible for the app/website rendering the ad to just call get_encrypted_match_key('dyson.com') and pass this value to Dyson (or The Trade Desk, if Dyson chose to delegate to them) in one way or another. I am definitely not a VAST expert - so I don't know much about macros. Maybe that's the way this would work?

DV360 serves video creatives (hosted/measured in GCM360) youtube.com and cnn.com. GCM 360 and DV360 will run into the same VAST XML issues above calling get_encrypted_match_key() so will need youtube.com and cnn.com to pass its value through the video players on those sites.

Yup. Should be the same as above. I assume youtube.com would just call get_encrypted_match_key('dyson.com') and would send a feed of information to Dyson (or to The Trade Desk, if Dyson chose to delegate to them) - that basically told it:

I served 100,000 video-views on youtube.com. Here are the encrypted match keys for all of them. And perhaps it would send along a "breakdown_key" for each one as well - perhaps this would indicate the creative that was used - to help Dyson understand how each creative was performing.

Amazon serves product ad creatives (hosted/measured in Sizmek by Amazon) on amazon.com and howtocleanstuff.net. via Amazon's SSP. Again, it is my understanding Amazon can call get_encrypted_match_key() just by nature of delivering ad code to the page for these sites.

Yeah, Amazon would do the same thing as youtube.com, showing ad impressions, and collecting ad clicks. It would call get_encrypted_match_key('dyson.com') and send a feed of information to Dyson (or to The Trade Desk, if Dyson chose to delegate to them) - that basically tells it:

Here's all of the ads we served. For each one here's a bit indicating if it was an impression or a click, and here's the encrypted match key for each. Again - perhaps Amazon might want to send some kind of "breakdown_key" along with these as well, perhaps to distinguish between creatives, or countries or something.

GCM360, The Trade Desk, DV360 and Amazon (including Sizmek) all have code on the Dyson landing page to log potential conversion events and therefore can call get_encrypted_match_key(), correct?

Yes, that's what I'm imagining. There are two options:

Option 1: They all have a "pixel" on dyson.com (potentially running in an iframe with an "iframe-allow-attribute" that allows dyson.com to control who is / isn't able to call get_encrypted_match_key()), which allows them to make calls like:

get_encrypted_match_key('nytimes.com')
get_encrypted_match_key('cnn.com')
get_encrypted_match_key('wsj.com')
get_encrypted_match_key('weather.com')
get_encrypted_match_key('howtocleanstuff.net')
get_encrypted_match_key('youtube.com')

Option 2: Dyson just executes all of those calls to "get_encrypted_match_key()" themselves - with their own script. Then it sends a feed of information to each of these publisher websites with the following information:

There was a trigger_event on dyson.com
It had a trigger_value of X
It happened at timestamp Y
I called get_encrypted_match_key() with your app/website name, and here's the value I got.

Each of these entities needs to run both source and trigger fan out queries (where they can based on the table above) to understand campaign performance to understand campaign performance for Dyson, correct?

No, I don't think so.

I think it is only Dyson who runs a "trigger fanout query". That query would include all of the conversions on Dyson.com, and ALL of the source events across ALL of the places their ads ran. Dyson wouldn't run any "source fanout queries".

All of the other apps/websites would only run "source fanout queries". These queries would relate ads shown on their site to ALL trigger events across ALL the advertisers they work with (including Dyson, but also thousands more).

So youtube.com would run "Source Fanout Queries" to answer questions like:

Yesterday, in the United States, across all of the advertisers who paid for ads on youtube.com, how many "website purchase events" did we drive? Dyson's trigger events would comprise but a small portion of the trigger events in that query.

Note that nobody is running queries that strictly relate JUST one source-website to JUST one trigger-website. They could - but that would be an inefficient use of their privacy budget.

Where each of these entities has both source and trigger data, they can individually select whatever helper party network for the epoch, correct?

Yes, each one of these businesses has full independence to select whichever helper party network they want. They also have full independence to select whichever match key provider they want. And they are free to change these values every epoch (i.e. week).

So Dyson will just "commit" to some value for the week, and when anybody calls "get_encrypted_match_key('dyson.com')", whether that call was made by youtube.com, or amazon.com or by a script written by The Trade Desk, it'll just use whatever value Dyson committed to for that week.

Similarly, youtube.com will "commit" to some value for the week, and when their pixel that's running on dyson.com calls "get_encrypted_match_key('youtube.com')" it'll just use whatever they committed to.

Suppose nytimes.com, youtube.com and amazon.com are interested in running source fanout queries. There might be several entities wanting to run them. In the case of nytimes.com I suspect the SSPs who delivered ads? Presumably Google runs source fanout queries for youtube.com and Amazon for amazon.com. Do each of those entities need to be running their source reports into the same helper party network GCM360 used, who saw all of what happened on dyson.com? Additionally, whichever of these entities wants to query will also need to align on the attribution constraint ID with at least GCM360 (the advertiser's ad server for all instances outside of amazon.com), correct?

OK, now we are getting into territory where I have much less well formed thoughts. I would be happy to talk more about what the use-cases are in this area, and open to suggestions to extend / alter IPA to support such use-cases.

At this time, the IPA proposal is very simplistic. Here's what the current "End to end doc" says:

Each of nytimes.com, youtube.com and amazon.com has an independent privacy budget. They can decide what they want to do with it. They can either run their queries themselves - or they can delegate this responsibility to another party to do it for them.

I'm assuming youtube.com will want to manage their privacy budget themselves. In this case, its up to them to collect source and trigger events, make queries, and decide how to spend their privacy budget. It's kind of irrelvant if ads were put on youtube.com via DV360 or GCM360, neither of those entities are important here. They do not have separate privacy budgets. They do not run queries.

Let's consider howtocleanstuff.net. Let's assume that they do not want to deal with the hassle of collecting source and trigger events themselves, making IPA queries, and understanding how to optimally spend their privacy budget. They want help. So they want to delegate this responsibility to someone else. They can do that. Let's imagine there is some business who is happy to provide this service to them. Let's imagine a business who currently offers an SSP decides to specialize in "Delegated IPA support" - and howtocleanstuff.net contracts with them. Cool. They basically register somewhere that the helpers and browsers and folks can easily see that "howtocleanstuff.net currently is contracting with SSP foo". SSP foo basically just acts as a "Service Provider" here. They can receive trigger events from advertisers (like Dyson) who send these events to them instead of to howtocleanstuff.net directly. They can run "source fanout queries" which relate source events from howtocleanstuff.net to trigger events all over the internet, from thousands of advertisers. They can use "breakdown_keys" to run A/B tests. For example, perhaps howtocleanstuff.net is experimenting with a few different types of ad formats. They can use breakdown_keys to understand things like how many total conversions are being driven per format.

But note that (as of the current "IPA end-to-end" document, and perhaps we need to extend this here...) "SSP foo" is NOT able to run "source fanout queries" which combine source events from multiple apps/websites. There is some discussion about why not, and for potential future extensions in this issue https://github.com/patcg-individual-drafts/ipa/issues/10

Similarly, Dyson has a choice about how they want to run IPA queries. Either they do it themselves, or they delegate this responsibility to someone else. Let's imagine they don't want to deal with all of these details, so they delegate. Perhaps "The Trade Desk" decides to offer some new "IPA delegation service" to ad buyers. Dyson contracts with them and formally commits to delegate their IPA responsibilities to them for this epoch.

Now youtube.com and amazon.com would send information about the source events Dyson paid for to "The Trade Desk" instead of sending them directly to Dyson. The Trade Desk would batch events up, run queries on behalf of Dyson, help them decide how to optimally spend their privacy budget, maybe give them a nice UI to help them understand the relative performance of their ads across all of these sites, etc.

But again, "The Trade Desk" would just be acting as a "Service Provider" (under the current proposal), with each "trigger fanout query" it runs on behalf of Dyson only containing trigger events from dyson.com and no other websites. These queries would consume dyson.com's privacy budget.

The Trade Desk would not be running some kind of network-wide IPA queries to measure the overall performance across all the advertisers they work with, across all the sites where they display ads. If this is a problem - let's talk about it. I'm open to learn about additional use-cases, and open to suggestions on how to extend IPA. The main thing we are going to run-up-against here is the total information leakage from the system. We need to ensure there is a total limit on the information that is released about an individual each epoch.

eriktaubeneck commented 2 years ago

Thanks @alextcone. This table is super helpful, let me dig into some of the questions from it:

	Source Collector Delegate for	Trigger Collector Delegate for
GCM360	nytimes.com, cnn.com, wsj.com, weather.com, youtube.com, howtocleanstuff.net	dyson.com
The Trade Desk	nytimes.com, cnn.com, wsj.com, weather.com, howtocleanstuff.net	dyson.com
DV360	cnn.com, youtube.com	dyson.com
Amazon	howtocleanstuff.net, amazon.com	dyson.com

There are a few things that emerge from this:

dyson.com needs someway to allocate budget to GCM360 , The Trade Desk, DV360, and Amazon. This is discussed a bit in #10, and likely deserves it's own issue.
One you split your budget, we have to talk about if that can happen across different helper party networks. If dyson.com has a single budget allocated to a single report collector, it's very simple: that report collector uses a single helper party network that manages the budget. a. Our proposal here is that dyson.com will make some sort of commitment to a single helper party network, which the helper party network will validate before running queries. We are explicitly trusting all helper party networks to not run queries for report collectors who don't have that proper commitment. b. One option would be to extend that commitment beyond a single helper party network to also include their budget allocation to report collectors, and the corresponding helper party network. For example, dyson.com's commitment could look like the code block below. c. This seems most aligned with the business incentives, as it's likely the report collector (e.g., GCM360 or The Trade Desk) who will primarily handle the business relationship with the helper party network, and not the site/app (e.g., dyson.com.)
```
[
{helper_party_network: hpn1,
 report_collector: GCM360,
budget: 0.25},
{helper_party_network: hpn2,
 report_collector: The Trade Desk,
budget: 0.25},
{helper_party_network: hpn1,
 report_collector: DV360,
budget: 0.25},
{helper_party_network: hpn3,
 report_collector: Amazon,
budget: 0.25},
]
```
All of this also applies exactly the same to the source side. IPA treats all sites exactly the same, and it's up to them if they want to run (or have a delegate run for them) trigger queries, source queries, or maybe both.
Some of these report collectors may want to run queries that include many source sites and many trigger sites. #10 discusses why this is currently problematic, but how we might be able to support it (but using budgets from multiple sites.) a. The above commitment scheme works nicely here, as it wouldn't require multiple source sites to all align on a single helper party network, but instead a report collector could use a singular helper party network, which has some budget allocated from many sites.

patcg-individual-drafts / ipa

Better understanding IPA "report collector" coordination needs #6

Scenario