privacycg / private-click-measurement

Private Click Measurement
https://privacycg.github.io/private-click-measurement/
196 stars 8 forks source link

The "Etsy Issue" #60

Open benjaminsavage opened 3 years ago

benjaminsavage commented 3 years ago

There are many things about how digital ads are run today that are going to change. It's an interesting question to ask: "What will not change?"

One thing that will not change is the existence of small businesses; in particular, small merchants who do not have their own eTLD+1 registered. Registering an eTLD+1, and hosting a website specific to a your business is a pretty high bar to demand of all businesses.

Etsy is a great example of a platform small businesses use today to easily get a presence on the internet. Furthermore, Etsy also offers merchants the ability to configure their own Facebook pixel. This makes it possible for Etsy sellers to run ads on Facebook and measure the number of resulting conversions.

We can't support these merchants using "Private Click Measurement" right now. The way the spec is currently written, ALL ads that run on facebook.com and direct to ANY part of etsy.com would be eligible to take credit for ANY conversion fired from ANY part of etsy.com. Unfortunately, this is not a particularly useful statistic for the individual merchants who sell their wares on etsy.com.

I would love to work together to find some solution that enables these merchants to continue counting the total number of conversions their facebook ads drive. I believe a privacy-preserving solution should exist that enables this functionality. All we need to enable is to compute a smaller intersection set.

For example, here is a shop on etsy: (www.etsy.com/uk/shop/LaurenAstonDesigns)

Let's imagine that this Shop ran an ad on facebook.com that linked to one of their products, perhaps this one (www.etsy.com/uk/listing/386566588/pink-chunky-knit-cushion-bright-pink). We need some way for that ad to say: "Clicks on this ad should NOT take credit for ANY conversion on etsy.com - they should ONLY take credit for conversions that happen on items in the shop: www.etsy.com/uk/listing/386566588"

Then, within the internals of Webkit, when we are processing a conversion event, and trying to decide which of the clicks requesting attribution are eligible to take credit - we would need to apply some additional checks. Assuming there are some pending clicks, in addition to checking that those clicks originally directed to some resource on "etsy.com", we would need to check to see if they were MORE specific about the set of pages on that site for which they would like to take credit. If so, we might find that some of the clicks are not eligible for this specific conversion and skip over them.

I'm not particular about how we solve this problem - I'd just like to see some solution. Previously, we've suggested the concept of "conversion filters" (Issue #36). Happy to continue iterating on that idea, or discuss alternatives.

benjaminsavage commented 3 years ago

Hi @johnwilander - just checking in here to see if you've had any thoughts on this topic. I'd be happy to discuss at the Privacy CG as well, should we add this issue to the agenda?

A few more domains that present similar issues for you to consider:

I was served an ad on Instagram for this star-tracker kickstarter project yesterday: https://www.kickstarter.com/projects/benropolaris/polaris-smart-electric-tripod-head

...as well as an ad for this hydroponic planter on indiegogo: https://www.indiegogo.com/projects/terraplanter--3

Both of these websites provide hosting for many businesses and are similarly affected by this issue.

At the last Privacy CG you mentioned in passing that it seemed similar to the "public suffix list". I've been thinking more about that comment, and I think there might be something there.

What if there was a publicly accessible mapping that showed the mapping from subdomains/paths to businesses? Internally, for the purposes of PCM we could just convert full URLs into "virtual [eTLD + 1]s" by following the mappings from that public list.

e.g.

Input Virtual [eTLD + 1]
www.etsy.com/uk/listing/386566588/pink-chunky-knit-cushion-bright-pink 386566588.etsy.virtual
https://www.etsy.com/uk/listing/748794925/wooden-parallettes-handstand-non-slip 748794925.etsy.virtual
www.kickstarter.com/projects/benropolaris/polaris-smart-electric-tripod-head benropolaris.kickstarter.virtual
https://www.kickstarter.com/projects/popoca/from-pop-up-to-popoca popoca.kickstarter.virtual
https://www.indiegogo.com/projects/terraplanter--3 terraplanter--3.indiegogo.virtual
https://www.indiegogo.com/projects/evo-shaver-world-s-smallest-shaver-ever evo-shaver-world-s-smallest-shaver-ever.indiegogo.virtual

If PCM just thought of all resources that started with these prefixes as existing on these "virtual domains", it seems like it would solve for this use-case while offering the same privacy protections as if these businesses had actually all set up their own websites.

johnwilander commented 3 years ago

Thanks for filing, Ben! This is on my mind and I'm open to suggestions on how to support these kind of use cases.

Even a static list akin to the PSL would allow for bucketing to boost the entropy. A small merchant could for instance set up their attribute-on website domains like this:

Then mirror their site for each of those subdomains and when the report comes in, they'll know much more about the user than intended. Note that there is nothing inherently wrong with wanting rough geo location to be part of the report but the intention is that you'd have to use your 8+4 bits to encode it, i.e. it should be part of the tradeoff.

You could also have rigs like:

We'd have to be able to guarantee that there's a 1-to-1 relationship between subdomains and merchants.

benjaminsavage commented 3 years ago

We'd have to be able to guarantee that there's a 1-to-1 relationship between subdomains and merchants.

Totally agree that this is the goal. I was thinking that by making the mapping public, it would provide the level of transparency which could help identify it being used for things other than just merchants. I recognize that this would only provide retroactive identification of mis-use, but it could be a step.

Perhaps there could be some sort of an approval / application process by which a given user-agent accepts an updated version of this public list. Perhaps a merchant marketplace could agree to some kind of set of guidelines for how this should and should not be used, and only once a given UA has confirmed their acceptance of these terms respect the mapping for that merchant marketplace?

A small merchant could for instance set up their attribute-on website domains like this: 1234356789.merchantMarketplace.example = Merchant A for customers on the US East Coast 987654321.merchantMarketplace.example = Merchant A for customers in Western Europe 987612345.merchantMarketplace.example = Merchant A for customers in the Middle East

Sure, but this is already a risk with vanilla PCM is it not? I could just register a number of [eTLD + 1]s for my business, and when running ads for them, ask the publisher to select the [eTLD + 1] that maps to the geographical region of the IP-address of the user to whom they are showing the ad, could I not?

Basically, there is no limit in PCM today on the number of [eTLD + 1] which can be measured. In principle, if you really wanted to, one could register a unique [eTLD + 1] per visitor to a website! (Although this might be costly and not realistic).

johnwilander commented 3 years ago

We'd have to be able to guarantee that there's a 1-to-1 relationship between subdomains and merchants.

Totally agree that this is the goal. I was thinking that by making the mapping public, it would provide the level of transparency which could help identify it being used for things other than just merchants. I recognize that this would only provide retroactive identification of mis-use, but it could be a step.

Perhaps there could be some sort of an approval / application process by which a given user-agent accepts an updated version of this public list. Perhaps a merchant marketplace could agree to some kind of set of guidelines for how this should and should not be used, and only once a given UA has confirmed their acceptance of these terms respect the mapping for that merchant marketplace?

Maybe. I know there's significant resistance against static lists but Google is pursuing First Party Sets which is another case so the door is not completely closed.

A small merchant could for instance set up their attribute-on website domains like this: 1234356789.merchantMarketplace.example = Merchant A for customers on the US East Coast 987654321.merchantMarketplace.example = Merchant A for customers in Western Europe 987612345.merchantMarketplace.example = Merchant A for customers in the Middle East

Sure, but this is already a risk with vanilla PCM is it not? I could just register a number of [eTLD + 1]s for my business, and when running ads for them, ask the publisher to select the [eTLD + 1] that maps to the geographical region of the IP-address of the user to whom they are showing the ad, could I not?

Two things make that case different:

Basically, there is no limit in PCM today on the number of [eTLD + 1] which can be measured. In principle, if you really wanted to, one could register a unique [eTLD + 1] per visitor to a website! (Although this might be costly and not realistic).

Correct. But how do you get the user back to the custom registrable domain? If they search for your product, they'll find your canonical website. You would have to bounce track the user dispatch style which means you're back to cross-site tracking.

benjaminsavage commented 3 years ago

But how do you get the user back to the custom registrable domain? If they search for your product, they'll find your canonical website.

I was mostly thinking of the case where the website doesn't receive much organic traffic from search, but is receiving most of the traffic from source_site. In that case, if the source_site has a concept of user identity, it can perform this mapping.

Different registrable domains cannot share cookies or other storage. Hence, they have no way of directing the user to the "right" registrable domain for conversion later

They could use a deterministic algorithm to map IP address blocks onto registrable domains. This deterministic algorithm could be utilized by both source and destination site.

benjaminsavage commented 3 years ago

In such a case, the user at least has a chance of seeing what's going on in the URL bar, i.e. websites' domains looking like 14635merchant.example

This is a great point. I think we may be able to leverage some type of transparency tool like this here as well.

I know we've discussed a UI which shows you information about the ads you've interacted with and the attribution reports the browser has sent out. I think that's a great idea. Could we leverage that idea here? Could we show you the "Merchant" to which the browser believes an ad / conversion maps? Could we supply metadata about that Merchant (e.g. name, website prefix, contact info, etc.) in a similar way to how you've suggested we display ad metadata?

dveditz commented 3 years ago

Let's imagine that this Shop [www.etsy.com/uk/shop/LaurenAstonDesigns] ran an ad on facebook.com that linked to one of their products, perhaps this one (www.etsy.com/uk/listing/386566588/pink-chunky-knit-cushion-bright-pink). We need some way for that ad to say: "Clicks on this ad should NOT take credit for ANY conversion on etsy.com - they should ONLY take credit for conversions that happen on items in the shop: www.etsy.com/uk/listing/386566588"

This example illustrates one complication trying to retrofit this proposal into existing sites. "386566588" is the listing, not the shop (as the url path says). Any random bits after the slash (or nothing at all) will get you to the same single listing from that shop. That shop has other listings with numbers that have no relation to that listing number. For ad-click attribution we'd want to credit the shop, which appears nowhere in the URL.

There's no way browsers could ship with a PSL-equivalent mapping for all "shops" on similar sites -- it'd be huge and constantly out of date. I'm not inclined to trust some /.well-known/-type mapping published by the sites themselves. Maybe such a mapping provided by an enumerated set of trusted "store hosters" (I'd trust one hosted on Etsy, for instance), but that gives incumbents an unfair advantage.

johnwilander commented 3 years ago

I tried looking at this from new angles to see if there are potential solutions we are missing. One idea popped up.

The main reason why we can’t support subdomains without enabling bucketing or even user identifying domains is that there is website-controlled, covert sharing across subdomains. The two main sharing mechanisms are cookies and document.domain but WebKit for instance partitions storage and cache based on registrable domain so that’s another sharing vector.

The idea that came to mind was that multi merchant sites could willfully give up those capabilities, forcing all of their own storage and state to be tied to origins. There is no such mechanism today but we could create one. There was work in W3C WebAppSec on such a thing as a security measure. Dan knows what I’m referring to.

Note that this would completely isolate each merchant and there would be no SSO or joint user account spanning multiple merchants.

Would multi merchant sites be interested in such a setup? Too disruptive? Maybe their whole business model relies on knowing what the user does at all the merchants’ mini sites?

(Note that this would not solve all the problems but it could be a starting point. I do also worry about bounce tracking that effectively joins all these subdomains back together so there may need to be restrictions on popups, top frame redirects, and navigations too.)

benjaminsavage commented 3 years ago

I tried looking at this from new angles to see if there are potential solutions we are missing.

Thank you John, I really appreciate you taking the time to think through this one!

The idea that came to mind was that multi merchant sites could willfully give up those capabilities, forcing all of their own storage and state to be tied to origins.

Well, from a technology perspective that definitely makes sense. It would then effectively be separate websites! This would definitely unify the threat model with that for multiple registrable domains... but at what cost?

Would multi merchant sites be interested in such a setup? Too disruptive? Maybe their whole business model relies on knowing what the user does at all the merchants’ mini sites?

Now we are out of my area of expertise. I am aware of multi-merchant sites having unified cross-merchant checkout capabilities. But when it comes to how important such capabilities are and their willingness to go-without, I cannot possibly comment on their behalf.

Let's try to get some engineers on this thread that work at such companies. I'll try to reach out to folks. If you know engineers that fit the bill by all means ask too. I'd love to broaden this conversation to include these parties directly.

benjaminsavage commented 3 years ago

This example illustrates one complication trying to retrofit this proposal into existing sites. "386566588" is the listing, not the shop (as the url path says).

Ah... you're absolutely right @dveditz - good catch!

There's no way browsers could ship with a PSL-equivalent mapping for all "shops" on similar sites -- it'd be huge and constantly out of date. I'm not inclined to trust some /.well-known/-type mapping published by the sites themselves.

Yeah, I have similar concerns.

As you point out, this really is a question of how to "retrofit" existing websites.

Once again, I'm really not in a place to comment here as I don't work at such a company. I'll try to see if I can invite engineers from these companies to join this discussion. But here are a few high-level observations I can make:

  1. One option is for such companies to make no modifications to their site layouts / capabilities. At the moment, this seems like it's headed towards a world in which individual merchants cannot measure the results of advertisements they run for their mini-shops.
  2. Another option is what John proposed. These websites could make the (major) change to fully separate the mini-sites with technological guardrails. This has the possibility to re-enable ads measurement but at the cost of potentially critical site functionality (like x-merchant checkout, and recommendations)
  3. One options could be a variant of what I've suggested. These websites could re-structure the way they use URLs. If they are currently using the path to show the listing this might require a change to ensure the initial prefix indicates the merchant. There is a path here to maintain ads measurement. I'm not sure if this is enough for John to be happy to NOT partition storage. Would this still require them to give up cross-merchant capabilities? A predictable and rigid structure might at least help us avoid the need to bake a PSL-equivalent into browsers.
  4. Another option is for these companies to instead offer something more like wordpress / shopify, where merchants actually DO have separate URLs per site. This would likely be a costly change, it loses some x-merchant functionality, and might cost each individual merchant the cost of registering a domain.
  5. Other ideas we haven't thought of yet?

These are a pretty grim set of options... I really hope we find alternatives! In the meanwhile, let's try to bring more engineers to this discussion.