Cross-channel measurement risks

martinthomson commented 2 years ago

At the recent meeting @csharrison mentioned a deliberate choice in Google's proposal that has attribution only occur for the third party that generates the events. This was done to prevent some sort of attack. This was not clear to me. Can we get a clearer articulation of the risks involved with cross-channel attribution?

(Not sure which template applies here, so I erased it.)

martinthomson commented 2 years ago

My understanding of the position @csharrison espoused is approximately this.

First, assume that sites will not do a whole lot in order to adapt to any changes in how measurement operates. The cost of adaptation will be borne by those to whom sites delegate responsibilities to (DSPs, SSPs, and other intermediaries). This means that sites will rely on active content from third parties (either executing with first party privileges ...ugh... or in frames) for managing the creation of information that will be used in any measurement system. It also means that the processing and maybe even interpretation of that information will be the responsibility of the same intermediaries.

The concern is that - without sites taking action to specifically privilege their chosen delegate - any actor that is able to execute content in these contexts is in a position to generate measurement events. Sure, we might build policy mechanisms that control this, but those mechanisms will not be consistently used, especially in early deployments. This includes possibly active content in creatives, which means that there are lots of people who might be in a position to "attack" the system.

Attacks here include dumping lots of fake impressions or clicks with a goal of falsely claiming credit for conversions. If events are generated through clicks on annotated links, an attacker might annotate every link with their own identity rather than an hoest one. The proposed mitigation is to have events bound to the entity that caused them to be generated.

Without intervention at the site level, this seems like a losing proposition. If we assume that sites do not act and that there are adversaries with the ability to run script in the origin of the site, the game is already up.

If events are generated in frames, where they can be attributed to the owner of the frame, that defense might be more effective. For clicks at least, because the browser is in a position to ensure that clicks are genuine. Impressions and opportunities (for lift) are fundamentally much harder to defend. Note here that while the incentive structure for opportunities is different, fake events are worth considering as the goal of the attacker isn't necessarily to claim credit.

IPA takes a different approach in that events can be generated by any party that is present on the page, but the choice of whether to accept an event into the computation is made by the entity that is performing the analysis. Events that you generate yourself are somewhat easier to digest than those created by others, and you could certainly structure your system so that you only consider your own events, if you choose. This is not something that you can do if the browser is matching ad events with conversion events as the browser is responsible for enacting any such policy, so that would appear to be a structural advantage to IPA.

For in-browser designs, it seems like you could have the browser apply some sort of policy, if it were possible to be confident in the provenance of events, but I'm not sure that this is always doable. Or maybe it just pushes toward using frames for every ad, which wouldn't necessarily be a bad thing (performance issues aside, that is).

csharrison commented 2 years ago

@martinthomson I think you got the gist. A few quick comments:

Events being generated in frames does not really solve the "configuration issue" with advertisers needing to come up with lists of delegates they trust. We don't really need frames to solve the "event attribution" problem either, we just need a trusted channel to the delegate e.g. via an HTTPS request. These kinds of requests are how the Attribution Reporting API works.
I need to understand more about how IPA works with the delegate system. From reading the doc, we expect sites to delegate creation of source / trigger events to third parties. It seems that, without explicit coordination between those third parties, we wouldn't necessarily get cross channel attribution, because every third party is just ingesting their own events. If we are relying on server-to-server coordination after the fact (merging data from multiple 3ps) then you also introduce another threat vector of intermediaries crafting fake data.

I agree IPA structurally makes some of this easier because you can make decisions after the fact without needing to encode it in browser policy available at attribution time. @btsavage also mentioned in the meeting that it could be possible to mitigate these issues also via after-the-fact detection (maybe at the cost of privacy budget), which is certainly better than nothing too.

martinthomson commented 2 years ago

Thanks @csharrison,

On 1, my intent was to address the provenance issue more than the configuration issue. (I agree that configuration is the best way to address this, even if we can't rely on it being uniformly implemented by sites.) If you operate a reporting system, having some confidence that the information (or at least the information generated by honest clients) was generated by a certain actor, then you are able to apply the sorts of controls you have proposed. I don't think that HTTPS is sufficient here (though it is certainly necessary as a means of ensuring authenticity when loading content into a context in the browser, and for ensuring that data is only sent where it is intended).

On 2, the intent is to have events recorded by any party present, then shared. Of course, if you are present, you can use the events you created yourself. (As we get more into the design, it seems like the events we are talking about might not change between API invocations, with the same information being presented to all parties for the same site over the course of each epoch, so having different people ask doesn't really make any difference.) As you say, if you aren't present, you don't get cross-channel attribution, so you need to rely on others to collect those events. But as noted, you choose who presents those events.

csharrison commented 2 years ago

Can you elaborate why an HTTPS response does not adequately address the provenance issue? Are you discussing here a use-case where actor A is trying to generate reports based on events caused by actor B? I think if the platform is the one which needs to know provenance, having a secure connection to the actor creating the event seems sufficient.

On 2, this makes sense to me, that was my understanding of the system from reading the doc. It seems like we will need substantial coordination and centralization of reports to get cross-channel, but maybe that's OK.

martinthomson commented 2 years ago

A secure connection to an origin doesn't naturally mean that actions can be attributed to that origin. In the case where events are generated by script, there is generally only one relevant origin, which is the document origin. Even if the script originated on a different origin - using HTTPS, of course - it's the origin of the page (or frame) that matters.

It is precisely this setting - where third parties are entrusted with the ability to execute code in the page origin - that is most challenging here. If we assume that sites don't change - either to apply any policy controls we might build, or to change how third party content is isolated - then third-party content will be executed in their origin and we'll lose any provenance information.

csharrison commented 2 years ago

In the case where events are generated by script, there is generally only one relevant origin, which is the document origin. Even if the script originated on a different origin - using HTTPS, of course - it's the origin of the page (or frame) that matters.

I agree if script is the thing generating the event (e.g. in IPA). In Attribution Reporting API, the HTTP response itself is the "event", where the event is configured in HTTP headers. This is precisely because the existing web bundles so many third parties together in one "security context" aka frame.

patcg / private-measurement

Cross-channel measurement risks #14