privacycg / private-click-measurement

Private Click Measurement
https://privacycg.github.io/private-click-measurement/
196 stars 8 forks source link

Ads / No-ads measurement (a.k.a. lift measurement) #64

Open benjaminsavage opened 3 years ago

benjaminsavage commented 3 years ago

Hi @johnwilander,

Just filing an issue to further discuss the idea we were just chatting about in the web-adv business group.

To recap: in the PCM spec there is a key-value pair: "source_engagement_type" : "click",

You mentioned this is there to enable possible future extensions.

I'd like to propose a new feature (that would utilize much of the PCM infrastructure), to enable an important ads use-case: "Ads / no-ads measurement"

I'd propose adding a new value for source_engagement_type, that would be opportunity.

The report JSON would look extremely similar:

{
  "source_engagement_type" : "opportunity",
  "source_site" : "social.example",
  "source_id" : [1-bit source ID],
  "attributed_on_site" : "shop.example",
  "attribution_trigger_data" : [4-bit trigger data],
  "version": 1
}

This particular use-case only requires 1-bit of source_id. 1 would indicate "ads were shown" and 0 would indicate "ads were not shown".

I imagine it would be similar to Extended SKAdNetwork in that the website showing ads would call an API to indicate that an "opportunity" had happened. As with extended SKAdNetwork, I wouldn't expect the browser to validate anything (e.g. was there a viewable impression). It would simply register that this call had been made.

function registerOpportunity(attributed_on_site, source_id)

For opportunities where an ad was actually shown, the website would call: registerOpportunity("shop.example", 1)

And for the "no ads" group the website would call: registerOpportunity("shop.example", 0)

When a conversion event is fired by a website, in a similar way to Extended SKAdNetwork, if the conversion was not attributed to a click, the browser could fall back to checking if there was an "opportunity" to which it could be attributed. If there was an opportunity, it would match the conversion to that opportunity and schedule a delayed, anonymized report.

So the website could then generate a report for advertisers to help them measure the difference in the number of conversions between the "ads" and "no ads" groups.

  1. For the "ads" group, they would need to add together the click-through and opportunity-only reports.
  2. For the "no-ads" group, they would only have opportunity-only reports.
  3. The 4-bit trigger data could be used to measure the (coarsely bucketed) purchase value. The sum of these values would also be reported for the "ads" and "no ads" group. The sum-of-squares of these purchase values is also important for computing the statistical significance of the report.

Since the browser is not validating anything about these "opportunities" it should be possible to use this system to validate that the "ads" and "no-ads" groups are being selected in a balanced, unbiased way. The website operator could just show ads to neither group, and validate that there is no statistically significant difference in the number of conversions. Any consistent difference would indicate some error on their part.

The reason this "ads / no-ads measurement" use-case is important, is that these heuristics we all use to attribute conversions are imperfect. Sometimes they undercount the true value of an ad campaign (for example, on a smart-TV where it's not really possible to click). Sometimes they over-count the true value of an ad campaign (for example, search ads often get the "last click" when someone is searching for something they learned about from an ad campaign they previously saw elsewhere).

Similarly to Extended SKAdNetwork, I think we could add some rate-limiting to prevent abuse. I'm just guessing here, but I suspect that your median browser would not be participating in more than a dozen "ads / no-ads" reports at the same time. If we were to allow for some headroom and cap the total number of "opportunities" that the browser would store for a given "source_site" to say 100 I imagine that would be fine. So long as that limits were clearly published, the "source_site" could make sure to not exceed these limits.

johnwilander commented 3 years ago

I think this is a really interesting idea.

We need to think about what the 4 bits would mean in terms of tracking capabilities since there is no user gesture and navigation limiting registrations of impressions. The obvious cross-site data leakage is website X asking "Does UserY use website Z?" which would be possible without UserX ever interacting with website X and Z in any joint or shared form. Knowledge of usage of website Z may leak information on all kinds of things like sexual preferences, faith, political opinions etc.

In our toolbox we have the regular things like local noise including dropped or bogus reports.

benjaminsavage commented 3 years ago

"Does UserY use website Z?"

How would one figure this out using this API? With just 1-bit of source_id and 4-bits of attribution_trigger_data it seems like you couldn't resolve this to a particular user.

Am I missing something?

johnwilander commented 3 years ago

"Does UserY use website Z?"

How would one figure this out using this API? With just 1-bit of source_id and 4-bits of attribution_trigger_data it seems like you couldn't resolve this to a particular user.

Am I missing something?

The big challenge with view-through on the web is that there is nothing gating or limiting a website from pushing impressions and conversions covertly with the purpose to track users. Here's how they would do it:

Week 1 trackerSocial.example decides it wants to figure out if John is Sikh. His activity on trackerSocial.example is pointing in that direction. So trackerSocial.example pushes impressions for sikhSiteA.example through sikhSiteN.example in John's browser and for no one else. This requires zero action by John other than visiting trackerSocial.example. Then trackerSocial.example converts on sikhSiteA.example through sikhSiteN.example and waits for the signal. It will now know that any attribution reports coming back will be for John and will reveal which Sikh websites he visits.

Week 2 trackerSocial.example decides it wants to figure out if Jenny is into manga …

benjaminsavage commented 3 years ago

So trackerSocial.example pushes impressions for sikhSiteA.example through sikhSiteN.example in John's browser and for no one else.

Yes, in principle the source_site could do this. However, the information leakage is limited by:

  1. The retention period of the "opportunity" signal. If the browser stores this for 1-week as with PCM, then there is a "once per week" limit. If we extend the retention to say a month, there is a "once per month" limit.
  2. The number of [eTLD + 1]s which receive a meaningful amount of traffic from the visitors of source_site.

So for a source_site like Facebook, which has literally billions of users, it would technically be possible to associate each high-traffic [eTLD + 1] with one of those users per week (or month).

This strikes me as an economically uninteresting attack vector since it just doesn't scale. In that sense, it's similar to the PCM attack vector of allocating a specific source_id to a specific person, so as to definitively track them using PCM. Yes, it's technically possible, but if there are only 256 source_ids available to use for measurement and the average ad campaign reaches 10^5? 10^6? 10^7? more? Using each source_id to measure a cohort seems like it would produce more economic value for the source_site, making this attack similarly economically uninteresting.

Another tool available is policy. I noticed in the webkit blogpost the paragraph entitled: "Misuse or Use Together With Tracking May Lead To Blocking". This is yet another tool in our toolbox. Having policies that prohibit these types of economically uninteresting, but technically possible types of behaviors is something else worth considering.