johnwilander commented 3 years ago

Two requests were brought up at a recent Privacy CG call and I said I'd write up the privacy analysis of why we think attribution reports cannot go to third-parties and to anything else than the registrable domain (eTLD+1).

Why Not Attribution Reports To Third Parties?

Some have requested that the click source site should be able to assign a reporting URL/domain other than its own. Others have requested that a third-party such as the host of an iframe where the click happens should be the one receiving the report.

Neither of these meet our privacy requirements. In both cases, the domains can be chosen to convey further information about the click.

Imagine for instance social.example where the ad click happens saying they want reports to go to johnwilander-social.example when I'm logged in there and to janedoe-social.example when Jane Doe is logged in. That would take us back to cross-site tracking in the subsequent report.

Similarly, ad links can be made to be served in iframes from johnwilander-social.example or janedoe-social.example to achieve the same level cross-site tracking.

Even Worse With Custom eTLDs

This issue becomes worse with tracking companies owning their own eTLDs under which it's virtually free for them to register new domains. They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking.

Why Not Attribution Reports To Subdomains?

Some have requested that attribution reports be sent to the full domain of the site where the click happens and similarly the full domain of the site where the conversion happens.

Neither of these meet our privacy requirements. In both cases, subdomains can be chosen to convey further information about the click or conversion.

Imagine for instance social.example where the ad click happens making sure the site is loaded from the subdomain johnwilander.social.example when I'm logged in there and from the subdomain janedoe.social.example when Jane Doe is logged in. That would take us back to cross-site tracking in the subsequent report.

The reason for restricting PCM reports to registrable domains is that the scheme+registrable domain, a.k.a. schemeful site, is the only part of a URL that is free from link decoration. All other parts can be made user specific, including subdomains.

You could of course imagine social.example setting up a registrable domain per user, such as johnwilander-social.example, and load the whole website from that domain when I'm logged in to get back to cross-site tracking of clicks. If that happens, we'd have to deal with it but at least the user has a chance to see that a personalized domain is used through the URL bar.

johnwilander commented 3 years ago

Ping @csharrison and @erik-anderson.

johnivdel commented 3 years ago

Imagine for instance social.example where the ad click happens saying they want reports to go to johnwilander-social.example when I'm logged in there and to janedoe-social.example when Jane Doe is logged in.

In the Click Through Conversion Measurement Event-Level API, we address this issue by requiring the destination site to trigger attribution using the same origin that reports will be sent to.

Without 3P cookies, in order to properly trigger the attribution, you would need to fire an attribution redirect for every possible identifying origin which will be sent reports. On larger sites this becomes impossible due to the number of potential origins, and browsers can actively limit sites triggering large number of attribution redirects in an attempt to find the correct identifying origin. This is covered in this section of the Event-level explainer.

A different motivating example for the use of third party reporting origins(or domains):

Ads are served on https://smallblog.example by third party https://adtech.example. One of the ads served links to https://bigadvertiser.example and receives a click and subsequent purchase. How does https://big_advertiser.example properly trigger attribution to https://smallblog.example? There could be thousands of sites which are showing ads for https://bigadvertiser.example. And if so, it would require thousands of separate GET requests to properly attribute the click.

Is this a valid use-case for PCM?

With the reporting_origin approach in the Event-level explainer, bigadvertiser only needs to fire a GET request to https://adtech.example (and to each of the other adtechs which serves its ads).

johnwilander commented 3 years ago

Imagine for instance social.example where the ad click happens saying they want reports to go to johnwilander-social.example when I'm logged in there and to janedoe-social.example when Jane Doe is logged in.

In the Click Through Conversion Measurement Event-Level API, we address this issue by requiring the destination site to trigger attribution using the same origin that reports will be sent to.

The same goes for PCM but that only partially mitigates the problem. The click source can iterate through a set of domains on the attribution side. That won't get them to full user IDs on its own but only iterating through 64 different domains adds another 6 bits of entropy to the resulting report which allows them to categorize their user base into 64 buckets. That foils the purpose of limiting bits of entropy in source ID and trigger data.

Without 3P cookies, in order to properly trigger the attribution, you would need to fire an attribution redirect for every possible identifying origin which will be sent reports. On larger sites this becomes impossible due to the number of potential origins, and browsers can actively limit sites triggering large number of attribution redirects in an attempt to find the correct identifying origin. This is covered in this section of the Event-level explainer.

A different motivating example for the use of third party reporting origins(or domains):

Ads are served on https://smallblog.example by third party https://adtech.example. One of the ads served links to https://bigadvertiser.example and receives a click and subsequent purchase. How does https://big_advertiser.example properly trigger attribution to https://smallblog.example? There could be thousands of sites which are showing ads for https://bigadvertiser.example. And if so, it would require thousands of separate GET requests to properly attribute the click.

Is this a valid use-case for PCM?

It'll eventually get to the modern JS API way of signaling a conversion. With that, you can imagine wildcard conversion. The tracking pixel redirect was never intended as a scalable, good solution for the future. It's a legacy support measure.

Obviously, supporting wildcard conversion signaling puts further pressure on the reporting URLs because now there is no scalability issue.

With the reporting_origin approach in the Event-level explainer, bigadvertiser only needs to fire a GET request to https://adtech.example (and to each of the other adtechs which serves its ads).

johannhof commented 3 years ago

I'm very sympathetic to these concerns, but I'm also a bit worried that this might lead to a situation where setting up an effective reporting structure including a third-party ad tech provider is technically challenging for developers and thus makes it harder for smaller parties (on all sides) to effectively run their business.

I also think that there's actually a chance for additional transparency to the user when the reporting origin is clearly revealed in browser UI instead of hidden behind a server-side redirect by the click source.

Did you consider letting the ad click destination list all possible reporting domains in some .well-known location in advance instead? That feels like it would resolve the issue of custom eTLDs without sacrificing too much flexibility for developers.

johnwilander commented 3 years ago

I'm very sympathetic to these concerns, but I'm also a bit worried that this might lead to a situation where setting up an effective reporting structure including a third-party ad tech provider is technically challenging for developers and thus makes it harder for smaller parties (on all sides) to effectively run their business.

The reports go to a very specific location. I imagine there will be services offered to listen to incoming requests on that endpoint, parse the data, and communicate it according to how the business is set up. By always sending data to first parties, we align with user expectations and there cannot be any doubt in who is in control of that data.

I also think that there's actually a chance for additional transparency to the user when the reporting origin is clearly revealed in browser UI instead of hidden behind a server-side redirect by the click source.

I don't follow this. Where would this browser UI be and what would it show?

Did you consider letting the ad click destination list all possible reporting domains in some .well-known location in advance instead? That feels like it would resolve the issue of custom eTLDs without sacrificing too much flexibility for developers.

First of all, there would have to be a limit on the size of that list and I'm not sure we can come up with a tradeoff between usefulness and opportunity for misuse.

Second, there is no way for browsers to know that all users are being served the same list when calling that .well-known location for a particular site. The list could change based on incoming cookies if such are sent or on network properties if not.

Finally, there is no way for a browser to know if e.g. 16 domains on that list are 16 distinct and legitimate reporting domains or if they are 16 domains owned by the same tracking company, allowing that company to categorize users into 16 buckets, effectively granting them 4 extra bits of entropy.

csharrison commented 3 years ago

Did you consider letting the ad click destination list all possible reporting domains in some .well-known location in advance instead? That feels like it would resolve the issue of custom eTLDs without sacrificing too much flexibility for developers.

I think this is a great idea.

First of all, there would have to be a limit on the size of that list and I'm not sure we can come up with a tradeoff between usefulness and opportunity for misuse.

I wouldn't jump to such a conclusion too soon. I wouldn't be surprised if many advertisers only have a very small number of 3rd parties they want looped into their measurements.

Second, there is no way for browsers to know that all users are being served the same list when calling that .well-known location for a particular site. The list could change based on incoming cookies if such are sent or on network properties if not.

I believe there are technical enforcements to make sure this is the case. A few quick ideas:

Query the list in two different contexts (i.e. on publisher / advertiser sites), without cookies, and ensure consistency before proceeding. This was the technique we originally used for Trust Tokens key consistency. With this technique, you can make the guarantee that being able to fingerprint using the list implies that the user has already been tracked via something like network fingerprinting (in which case there is no added benefit of the list).
Use a trusted server side component to fetch the lists from advertisers, and have browsers query that component for the complete set of lists. This is the basic approach Chrome ended up using for Trust Tokens using our Component Updater.
Server audit + blocklist (i.e. have a server query the list to ensure consistency, put cheaters on a static block list)

I am sure we could come up with other ideas in this space.

Finally, there is no way for a browser to know if e.g. 16 domains on that list are 16 distinct and legitimate reporting domains or if they are 16 domains owned by the same tracking company, allowing that company to categorize users into 16 buckets, effectively granting them 4 extra bits of entropy.

Agree this should be in our threat model.

johannhof commented 3 years ago

I don't follow this. Where would this browser UI be and what would it show?

Oh, I was referring to the (very hypothetical) UI we thought up in #54 and related discussions. The user could have a way to see pending attributions with the tuple (click source, ad source, reporting origin) instead of just (click source, ad source) which might give more transparency over who will end up handling the data.

First of all, there would have to be a limit on the size of that list and I'm not sure we can come up with a tradeoff between usefulness and opportunity for misuse.

As @csharrison said keeping this list very small (4 slots?) might already be enough, we could still allow reporting to the click origin to ensure that we don't disadvantage models that don't centralize data collection through an ad tech service (if those exist).

Second, there is no way for browsers to know that all users are being served the same list when calling that .well-known location for a particular site. The list could change based on incoming cookies if such are sent or on network properties if not.

What's the difference between that and the final attribution request, meaning couldn't the same measures be applied to both requests, e.g. not sending cookies and delaying the request by 24h? (Without the other measures Charlie described) the advertiser may encode additional data based on IP address but won't the receiver observe the same address then, making additional identifiers unnecessary?

Finally, there is no way for a browser to know if e.g. 16 domains on that list are 16 distinct and legitimate reporting domains or if they are 16 domains owned by the same tracking company, allowing that company to categorize users into 16 buckets, effectively granting them 4 extra bits of entropy.

I agree that enumerating reporting origins is another potential source of bits for a dedicated attacker, by grouping users into n buckets using custom eTLDs. Leaving this unchecked is essentially destroying all privacy guarantees the spec otherwise tries to enforce. Hence I'm trying to suggest a way to at least control n so that we can reason about how much entropy we're adding in the worst case and find a compromise based on that.

johnwilander commented 3 years ago

I don't follow this. Where would this browser UI be and what would it show?

Oh, I was referring to the (very hypothetical) UI we thought up in #54 and related discussions. The user could have a way to see pending attributions with the tuple (click source, ad source, reporting origin) instead of just (click source, ad source) which might give more transparency over who will end up handling the data.

Although I'm in favor of the idea of transparency UI, I don't think it'll serve as a meaningful defense against misuse, especially not for the majority of users. Users know about first party websites, that's about it.

First of all, there would have to be a limit on the size of that list and I'm not sure we can come up with a tradeoff between usefulness and opportunity for misuse.

As @csharrison said keeping this list very small (4 slots?) might already be enough, we could still allow reporting to the click origin to ensure that we don't disadvantage models that don't centralize data collection through an ad tech service (if those exist).

There is always a risk of creating barriers to entry with such small limitations. Who do you think will get to be among the 4? 😕 Additionally, PCM has received positive feedback for its support of multiple sources of attribution, including for a single conversion.

Second, there is no way for browsers to know that all users are being served the same list when calling that .well-known location for a particular site. The list could change based on incoming cookies if such are sent or on network properties if not.

What's the difference between that and the final attribution request, meaning couldn't the same measures be applied to both requests, e.g. not sending cookies and delaying the request by 24h? (Without the other measures Charlie described) the advertiser may encode additional data based on IP address but won't the receiver observe the same address then, making additional identifiers unnecessary?

IP address tracking is a separate thing that doesn't improve or worsen any of this. It will have to be dealt with separately. We shouldn't use IP address tracking as an argument as to why we don't need other protections. PCM is designed to be privacy-preserving on the web platform level.

Finally, there is no way for a browser to know if e.g. 16 domains on that list are 16 distinct and legitimate reporting domains or if they are 16 domains owned by the same tracking company, allowing that company to categorize users into 16 buckets, effectively granting them 4 extra bits of entropy.

I agree that enumerating reporting origins is another potential source of bits for a dedicated attacker, by grouping users into n buckets using custom eTLDs. Leaving this unchecked is essentially destroying all privacy guarantees the spec otherwise tries to enforce. Hence I'm trying to suggest a way to at least control n so that we can reason about how much entropy we're adding in the worst case and find a compromise based on that.

There is zero room for additional bits of entropy. If there was any additional room, it should be spent on the specified data values sourceID and triggerData. If we start saying we can allow a few more bits, we've lost the privacy-preserving properties of PCM which is one of the reasons for doing it at all.

erik-anderson commented 3 years ago

I think that Charlie's technical enforcement ideas have merit. Each idea adds some level of additional complexity for the browser vendor, either with standing up a service to cross-check what different clients are seeing (and reasoning through the privacy implications of that service) and/or having additional networking requests from different contexts.

@johnwilander do you have any thoughts on the viability of those mitigations?

johnwilander commented 3 years ago

I think that Charlie's technical enforcement ideas have merit. Each idea adds some level of additional complexity for the browser vendor, either with standing up a service to cross-check what different clients are seeing (and reasoning through the privacy implications of that service) and/or having additional networking requests from different contexts.

@johnwilander do you have any thoughts on the viability of those mitigations?

I think they add too much complexity, not just for browser vendors but also for developers. Complexity often translates to barrier to entry which can lead to only large, well-funded adtech vendors being able to set everything up.

Just to drill into potential complexities, let's say the user clicks an ad on Day 1, converts on Day 4, and the report is supposed to be sent to ThirdParty.example on Day 6. At which day should the browser check the validity of the ThirdParty.example endpoint for this advertiser? If it's not Day 6, things may have changed and there's no way for someone inspecting the advertiser's website on Day 6 to see that reports are allowed to go to ThirdParty.example. If it is Day 6, then we have to support some kind of time stamping of report endpoints so that the advertiser's website can state that "Conversions between these timestamps can go to ThirdParty.example but nowadays I don't use ThirdParty.example anymore because I've switched to OtherThirdParty.example."

In addition, I think too much time and effort is spent on trying to cater for the old ways of doing things. PCM is not trying to be a drop-in replacement for how things worked in the world of cross-site tracking and tons of third-parties collecting and sending data. This is about a new world where it's reasonably clear to users, developers, and advertisers where data is sent. If they want to share that data with a business partner, it's on them to make that clear to their users and also to live up to legal requirements for data sharing in the current jurisdiction.

michaelkleber commented 3 years ago

@johnwilander I'm having a hard time seeing your argument that they "add too much complexity[...] for developers."

For a typical website that wants to use a third-party reporting endpoint, their marginal setup effort could be as low as adding a static text file at a .well-known location. This is orders of magnitude simpler than setting up any server-side proxying approach or splitting off a CNAME'd subdomain.

Sure, there is some complexity for browser vendors, which is appropriate; that's our job. But none of these potential complexities seem particularly challenging.

johnwilander commented 3 years ago

@johnwilander I'm having a hard time seeing your argument that they "add too much complexity[...] for developers."

For a typical website that wants to use a third-party reporting endpoint, their marginal setup effort could be as low as adding a static text file at a .well-known location. This is orders of magnitude simpler than setting up any server-side proxying approach or splitting off a CNAME'd subdomain.

Sure, there is some complexity for browser vendors, which is appropriate; that's our job. But none of these potential complexities seem particularly challenging.

For a simple, benign, "happy path" case, it might work functionally. That would be advertiser.example choosing adtech.example as its one and only reporting endpoint forever.

But unless we allow them to change the endpoint, we will truly create a barrier to entry. So some kind of managed change needs to be supported.

Allowing multiple reporting endpoints could allow for flexibility in a purely additive way as long as the advertiser doesn't reach the limit and perhaps cover most cases. But as mentioned above, a set of reporting endpoints can be doctored to convey cross-site data.

Let's then assume that we restrict it to a single reporting endpoint that can be changed. If we don't add further restrictions, advertiser.example can cycle through reporting domains time01.example, time02.example, …, time24.example by the hour and that way encode when in time the conversion happened. I.e. cross-site data leakage.

If we say the change needs to be mirrored in time on the click side, you now have a sync issue where the advertiser needs to tell all publishers that when the clock strikes twelve, all must change some file on their server. And even that would be susceptible to gaming with synchronized changes every eight hours (asiaTracking.example, emeaTracking.example, and americaTracking.example) or every three hours (morningTracking.example, middayTracking.example etc).

To solve for that we'd need browsers checking multiple times at random which will create a risk of data loss during changes to the reporting endpoint or create a barrier to entry because no one ever wants to change their reporting endpoint.

csharrison commented 3 years ago

It seems to me that checking the desired reporting endpoint at report-send time avoids all the timing attacks. It does introduce some complexity if you go with an approach that checks for consistency across contexts, although there are technical solutions beyond that approach. Even in that case though I don't think it's an unreasonable amount of complexity. When you want to add another reporting endpoint you can "pre-declare" it in your configuration and you include a date when you want it in effect.

johnwilander commented 3 years ago

It seems to me that checking the desired reporting endpoint at report-send time avoids all the timing attacks. It does introduce some complexity if you go with an approach that checks for consistency across contexts, although there are technical solutions beyond that approach. Even in that case though I don't think it's an unreasonable amount of complexity. When you want to add another reporting endpoint you can "pre-declare" it in your configuration and you include a date when you want it in effect.

I agree that a single reporting endpoint grabbed at the time of reporting is the safest from a cross-site tracking perspective, simply because it is equivalent to the report going to either the click source or the advertiser, and them either forwarding that info or redirect the request (if allowed).

What remains then is the user perspective and transparency.

The user perspective. Users don't know about third-parties and we are sending data about their activity. Users won't expect their user agent to send that data to a third-party and it'll be hard to explain to them why.

Transparency. There will be no way to tell the user where their data will go before the time of reporting. This means there will be no way for the user to inspect a website at the time of ad click or time of conversion to see where data about their activity will ultimately go. At a random time 24-48 hours later, a report might be sent to a third-party domain they have never seen and will never see.

dialtone commented 3 years ago

The user perspective. Users don't know about third-parties and we are sending data about their activity. Users won't expect their user agent to send that data to a third-party and it'll be hard to explain to them why.

I think it's a bit simplistic to think that just because the user sees a URL then automatically it will go to just the party owning the URL. Millions of businesses rely on Shopify for putting up their shopping cart solution. In this case it's really Shopify getting the data, will the user not be surprised by it? Shopify then shares this data server side with all of the apps in the app exchange. There are many other equivalent platforms that help you build sites and collect data and share it around.

johnwilander commented 3 years ago

The user perspective. Users don't know about third-parties and we are sending data about their activity. Users won't expect their user agent to send that data to a third-party and it'll be hard to explain to them why.

I think it's a bit simplistic to think that just because the user sees a URL then automatically it will go to just the party owning the URL. Millions of businesses rely on Shopify for putting up their shopping cart solution. In this case it's really Shopify getting the data, will the user not be surprised by it? Shopify then shares this data server side with all of the apps in the app exchange. There are many other equivalent platforms that help you build sites and collect data and share it around.

I'm not saying users will understand that data will go to the first party. I'm just saying they will be almost guaranteed to not understand that the data goes to a third-party straight from their browser.

I think our best chance of getting users' buy-in on measurement of online advertising is making it as easy as possible to explain to them what's happening and align the measurement practice with their mental model of browsing the web.

michaelkleber commented 3 years ago

The user perspective. Users don't know about third-parties and we are sending data about their activity. Users won't expect their user agent to send that data to a third-party and it'll be hard to explain to them why.

Every time a browser renders a web page, it offers a channel by which the first party triggers arbitrary requests to third parties.

Transparency. There will be no way to tell the user where their data will go before the time of reporting. This means there will be no way for the user to inspect a website at the time of ad click or time of conversion to see where data about their activity will ultimately go. At a random time 24-48 hours later, a report might be sent to a third-party domain they have never seen and will never see.

Sure, but it seems reasonable to communicate this appropriately. "This report will be sent in approximately 17 hours, to shoes.example, or to a 3rd-party service who collects data on their behalf. (Shoes.example currently uses nifty-analytics.example for this.)"

dmarti commented 3 years ago

There might be a middle ground between the extremes of same site only and allowing reports to go to arbitrary URLs. Could the browser limit the URLs that can receive reports to a set of developer-friendly but tracking-unfriendly patterns based on the first party URL?

So a click when the first party is example.com could send reports to

serviceprovider.com/privateclick/example.com
privateclick.serviceprovider.com/example.com

but not to 487f90aa469c6234.customTLD?

johannhof commented 3 years ago

Just to repeat the current state of our alternative suggestion as I see it (mixing ideas from different comments here):

Walking through a regular click to conversion event:

In its code, the ad click source specifies a link with a destination as specified by PCM.
The user clicks on the ad and is led to the destination.
In the background, from the ad click source, the browser queries a list of .well-known possible reporting endpoints and stores it alongside the ad click.
The browser now optionally adds an entry to its UI detailing conversion reports: example.news wants to measure whether your click leading to example.shop has resulted in a conversion. [examples.news has designated the following services to receive that information: {list of endpoints}]..
The user converts on the advertiser page.
The advertiser sends the conversion calls to its supported endpoints as specified by PCM, thereby explicitly naming them at conversion time.
If the browser has no click event stored for the advertiser origin, it discards the request and stops.
If none of the requested endpoints matches any of the endpoints stored for the click based on the .well-known info from the ad click source AND the requested endpoint isn't same-site with the ad click source, the browser discards the request and stops.
The browser now optionally updates the entry to its UI detailing conversion reports: A click you made on example.news has led to a conversion on example.shop. [The two websites have chosen adtech.example to handle this information for them.] Click here to see the data that will be sent to adtech.example. The request will be sent in 23 hours..
A random time later, the browser requests the .well-known list of supported reporting endpoints from the ad click destination.
If the designated reporting endpoints appears in this list, too, then finally the report is sent to that endpoint.

(the only request in this list that happens in a non-isolated fashion, sending cookies etc. is the top-level navigation to the destination)

Now a practical issue for websites and adtech is of course changing their infrastructure between the time when conversion reports are sent out. While we shouldn't completely ignore that issue in practice I don't think it should drive our decision-making here.

The other thing that we haven't fully resolved from a privacy perspective is hard-coded "bucketizing" on the side of the ad-tech company to add (a low number of) additional bits. I think that this is controllable both by strongly limiting the number of allowed reporting endpoints as well as simply enforcing regulatory action against advertisers, ad sources and ad tech. All of them have a high stake in not getting their domain denylisted while browsers are at little risk in blocking (or normalizing) conversion requests from bad players.

So I think/hope this can cover all the (very valid) concerns from @johnwilander (except more complexity at the browser side which I think is a fair price to pay for us). Specifically I hope that the UI pieces make it clear how I would see this being presented/explained to users.

Please let me know if I got anything wrong. :)

johannhof commented 3 years ago

@dmarti What about 487f90aa469c6234.customTLD/privateclick/example.com in that scenario?

dmarti commented 3 years ago

@johannhof The browser would have to match some known TLDs based on their registration policies and costs. .com would be in because a registration costs more than a click is worth, and other domains could make it in based on a privacy policy that disallows registration of tracking-specific domains. (This list could be started based on TLDs represented in the existing Disconnect list, or IAB TCF vendorlist.json. If a company has a reporting endpoint on a different TLD that they want a browser to use, they could send a pull request to add it, with a link to the relevant policies.)

If I understand the bucketizing problem, it's tracking companies registering say 64 domains and encoding a unique ID based on which domains the click is reported to. Agree that the number of endpoints should be limited in the browser. Possibly shuffle the list of endpoints for a click, always report to the first n endpoints on the shuffled list, then start randomly dropping the extras.

johannhof commented 3 years ago

Recapping the Jan 14 conversation on this issue:

We asked industry representatives for their opinion on:

the model proposed in this comment vs. reporting to the ad click source only, as in the original design
single-endpoint vs. multi-endpoint structures

There was generally a very positive sentiment towards the newly proposed model of flexible reporting endpoints via declaration in .well-known, saying that it would be hard for smaller parties to adopt this standard otherwise. There was a sentiment that smaller publisher may find forwarding technically very challenging, though at least one larger vendor said that they could probably make things work in a forwarding model. Multiple parties also brought up the concern that forwarding could lead to lock-in effects where it's hard for publishers to switch providers vs. just returning a different reporting endpoint.

A lot of folks also supported the idea of enabling a multi-endpoint model, again noting that it would otherwise be hard for smaller sites and marketers to pick this up. I guess it's up for discussion to what extent we can enable this while keeping potential "bucketizing" in check.

johnivdel commented 3 years ago

In the Jan 14th conversation I alluded to optionally using some of the publisher side bits to choose among multiple reporters.

This would address the issue of having lock in with a single reporter. If multiple reporters were allowed, this prevents having to send a single parties report to all of the other reporters.

The concern was that this would allow personalized endpoints due to having access to the 1P cookie when the anchor tag is configured.

I think if we combined this idea with the approach in this comment, there would be no risk of personalization. Concretely:

Take two (as an example) of the publisher side bits and encode them as a reporterindex attribute. At each of the steps where the .well-known is checked in the aforementioned comment, only use the reporting endpoint at the specified reporterindex.

The checks at click + report time prevent any form of personalization from being effective.

gmeroz commented 3 years ago

Did you consider letting the ad click destination list all possible reporting domains in some .well-known location in advance instead? That feels like it would resolve the issue of custom eTLDs without sacrificing too much flexibility for developers.

I think it's a good idea. This will provide advertisers the opportunity to use unbiased solutions to measure their attribution without depending on the publisher to report the data directly to them. It will also reduce development from the publishers.

johnwilander commented 3 years ago

Did you consider letting the ad click destination list all possible reporting domains in some .well-known location in advance instead? That feels like it would resolve the issue of custom eTLDs without sacrificing too much flexibility for developers.

I think it's a good idea. This will provide advertisers the opportunity to use unbiased solutions to measure their attribution without depending on the publisher to report the data directly to them. It will also reduce development from the publishers.

We already intend to send the attribution report directly to the merchant/advertiser so that won’t be an issue: https://github.com/privacycg/private-click-measurement/issues/53

dialtone commented 3 years ago

It doesn't seem that this issue has resolved or fully discussed the topic of who should receive the reporting of a conversion. Taking for example the discussion on May 20th at the virtual face 2 face on fraud and how incentives between publisher and advertisers are mis-aligned in the identification of fraud.

Most of the time campaigns will be executed by a DSP, which is the intermediary that will need to pay the publisher and receive payment from the advertiser but will have no data of its own to determine what is true or matching.

At the same time, for what relates to fraud, publisher and advertiser can disagree about how something is defined as fraud without a 3rd party capable of providing an impartial view of what happened.

Lastly again I think from the current discussions I think it's unclear how a small business is going to be able to manage any of this, unless the platform or technology that they use allows them to work with all of the measurement parties or DSP or fraud vendors or brand safety vendors thus limiting the choice of the small advertiser to work with whoever guarantees the best service vs those that are integrated with the platform they use for their site. It's clear that a big advertiser can make their own tech to manage this, but smaller business should be able to advertise on the web as well without needing to invest in a team for fraud detection, brand safety, campaign trafficking, measurement, data management and so on.

michael-oneill commented 3 years ago

Source IP addresesses, especially when combined with other identifiers such as the FLoC cohort ID, can uniquily identify the user. If PCM implementations started sending reports directly to parties other than the click sourse or destination then personal data is being processed, and regulators would point out the data subject had to give informed consent for it.

This undermines the whole point of PCM.

This would also be true if third-parties arranged for reports to be sent via a client-side redirect of the /.well-known resource.

Reports could be automatically redirected server side, and CMSs and platforms could provide that capability.

dialtone commented 3 years ago

Well, now Safari will hide the original IP address from trackers so the tracking you mention is less relevant, if relevant at all. At this point I'm not really sure what tracking could happen here at this point and not sending the report to third parties will just make the cost resolution problem worse, among the many things as already described.

michael-oneill commented 3 years ago

I read it as just for email, but is it web also? I agree widescale IP blindness could make a big difference for privacy protective tech.

dialtone commented 3 years ago

Incidentally this is also what happens on SKAdNetwork that now reports will be sent to both advertiser and ad network which is great.

vincentsaluzzo commented 3 years ago

As mentioned by @dialtone , on SKAdNetwork, advertisers and ad networks was able to receive attribution report. That was mainly possible on SKAdNetwork because the ad display by a publisher was signed and guaranteed by an approved AdNetwork. The publisher indicates that this ad network is a valid ad network for them.

Why don't do the same kind of solution for PCM?

The ad provided by a network diffuses the AdNetworkID to inform in case of successful attribution. When a click is made on the ad, browsers store campaign-unique ID like actual PCM implement plus the AdNetworkID. When the conversion event is fired on the advertiser website, the browser has to send an attribution report to the publisher domain, but also on the AdNetwork domain identified by his IDs.

In this situation, concern still exists when their multiple publisher websites reporting click events for the same campaign. But this concern also exists on SKAdNetwork. Using a timestamp to inform only the last click event associated with the campaign could be a solution.

johnwilander commented 3 years ago

As mentioned by @dialtone , on SKAdNetwork, advertisers and ad networks was able to receive attribution report. That was mainly possible on SKAdNetwork because the ad display by a publisher was signed and guaranteed by an approved AdNetwork. The publisher indicates that this ad network is a valid ad network for them.

Why don't do the same kind of solution for PCM?

The ad provided by a network diffuses the AdNetworkID to inform in case of successful attribution. When a click is made on the ad, browsers store campaign-unique ID like actual PCM implement plus the AdNetworkID. When the conversion event is fired on the advertiser website, the browser has to send an attribution report to the publisher domain, but also on the AdNetwork domain identified by his IDs.

In this situation, concern still exists when their multiple publisher websites reporting click events for the same campaign. But this concern also exists on SKAdNetwork. Using a timestamp to inform only the last click event associated with the campaign could be a solution.

Without discussing SKAdNetwork which is not a proposed standard and doesn’t exist on the web, the reason why PCM cannot send reports to third parties while protecting against cross-site tracking is explained in the description of this issue:

This issue becomes worse with tracking companies owning their own eTLDs under which it's virtually free for them to register new domains. They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking.

vincentsaluzzo commented 3 years ago

Without discussing SKAdNetwork which is not a proposed standard and doesn’t exist on the web, the reason why PCM cannot send reports to third parties while protecting against cross-site tracking is explained in the description of this issue:

This issue becomes worse with tracking companies owning their own eTLDs under which it's virtually free for them to register new domains. They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking.

Could we imagine a registry of third party domain allowed? The owner of customTLD have to register they TLD or TLD+1 to use for reporting, for example report-attribution.customTLD and could only refer this domain in ad.

In your example, if an ad network declare in the ad to custom report on 487f90aa469c6234.customTLD. This domain will not exist in the registry and the third party will not be reported by the browser. If they put report-attribution.customTLD, browser know it and sent the report without any user information, like expected.

That will prevent any use of domain for cross site tracking.

gmeroz commented 3 years ago

This issue becomes worse with tracking companies owning their own eTLDs under which it's virtually free for them to register new domains. They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking.

As long as there's only on domain to report the attribution, it prevents the cross site tracking. We can't expect every publisher and and every advertiser to develop with own systems to deal with this process and it's likely they will prefer to use a 3rd party to do that. The current proposal prevent that.

dialtone commented 3 years ago

Without discussing SKAdNetwork which is not a proposed standard and doesn’t exist on the web, the reason why PCM cannot send reports to third parties while protecting against cross-site tracking is explained in the description of this issue:

This issue becomes worse with tracking companies owning their own eTLDs under which it's virtually free for them to register new domains. They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking.

I sympathize with the issue, and think we should find a solution to that, but the proposed setup isn't free from problems. The current proposal where data doesn't flow to the third party directly basically incentivizes server side integrations like Facebook Conversions API in which significant amounts of PII data can be exchanged with minimal to no ability to enforce any sort of compliance from a regulator or a browser. If we're looking to make the web more privacy preserving we should consider also the consequences of certain decisions. I don't think in good faith anyone could consider this way of sharing data an improvement over the existing way.

jbpringuey commented 3 years ago

Maybe it would be easier to resolve this issue if it was separated in 2 issues?

Attribution reports cannot go to third-parties
Attribution reports cannot go to anything else than the registrable domain (eTLD+1)

For smaller publishers, using a third party platform (and thus domain) to help them monetize their inventory is often more economically viable than having their own IT and salesforce. They need to publish content, work on SEO, find advertising budgets and a lot more. If we allow only one third-party reporting domain per impression and enforce the registrable domain (eTLD+1) only, it could work no ? My understanding is that ITP will start to proxy tracker calls to hide the IP. After this it would be virtually impossible to identify someone no ?

michael-oneill commented 3 years ago

The iCloud IP relay service is subscription only. Its a great service and would be nice if it was for everybody, but its not, - so does not solve this privacy issue.

If it was available, and the browser could detect it, then maybe this could work in those instances.

dialtone commented 3 years ago

The iCloud IP relay service is subscription only. Its a great service and would be nice if it was for everybody, but its not, - so does not solve this privacy issue.

If it was available, and the browser could detect it, then maybe this could work in those instances.

Actually it is available in the browser for free i think and i think i read that PCM will send the reports through it anyway.

michael-oneill commented 3 years ago

Can you post a link?

dialtone commented 3 years ago

Here you go https://developer.apple.com/documentation/safari-release-notes/safari-15-beta-release-notes both for the pcm reports and hide ip from trackers there's no note about private relay being needed.

michael-oneill commented 3 years ago

This is great, it should go into the spec i.e. if browsers use IP hiding reports can go to arbitrary destinations (as long as the urls are entropy limited - e.g. no subdomains and fixed /.well-known path).

johnwilander commented 3 years ago

If we allow only one third-party reporting domain per impression and enforce the registrable domain (eTLD+1) only, it could work no ?

One third-party reporting domain per impression would immediately defeat the privacy protections, just as I explained above: "They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking."

For this to work, it would have to be one third-party per website and stable over time.

The way we'd have to do it is check the endpoint twice – once at click time, and once at attribution report time, to make sure it stays the same. That's the design for checking the public key of the optional click fraud token. I believe such a design was briefly discussed above or in a related issue.

IP protection is indeed what is needed here. It enables click fraud tokens too. We'd likely have to write the spec so that only browsers with IP address protection should allow click fraud tokens and third-party reporting domains.

johnwilander commented 3 years ago

Note that such a spec update would say that IP address protection is only needed for these specific parts of PCM, not in general. I.e. it wouldn't make PCM impossible to support without paying browser customers or a bunch of other funding.

jbpringuey commented 3 years ago

One third-party reporting domain per impression would immediately defeat the privacy protections, just as I explained above: "They could simply put a unique event ID in the domain, such as 487f90aa469c6234.customTLD and be back to web scale event-level cross-site tracking."

For this to work, it would have to be one third-party per website and stable over time.

Publishers and advertisers often put in competition multiple ad tech vendors to get the most revenues, especially small and medium business. Would one (or a max of 10?) third-party FQDN per ad tech vendor work ? For example only https://www.tracking.adtech.com would be allowed at one time for adtech.com but not 487f90aa469c6234.adtech.com etc .. It would be a much better option in my opinion if it works on the security side.

michael-oneill commented 3 years ago

Maybe the IP relay ingress server could count the number of subdomains in third-party domains, & refuse to forward them after the list goes > 10. There would have to be a side channel so it can recognise PCM reports, see the taget url etc.

dialtone commented 3 years ago

Ingress doesn't know the url of the site being requested but i had similar ideas.

johnwilander commented 3 years ago

Publishers and advertisers often put in competition multiple ad tech vendors to get the most revenues, especially small and medium business. Would one (or a max of 10?) third-party FQDN per ad tech vendor work ? For example only https://www.tracking.adtech.com would be allowed at one time for adtech.com but not 487f90aa469c6234.adtech.com etc .. It would be a much better option in my opinion if it works on the security side.

Do you mean to send the attributions to the same set of third-parties every time? Because if sites were allowed to pick and choose per impression or conversions, they could do so to boost the number of bits that get through.

One thing to be aware of is that a soon as the configured third-party domain(s) change(s), all pending clicks and attributions will be deleted since the browser will not be able to tell if the changed configuration is an attempt to boost the number of bits that get through. I would imagine that with a possible larger set of third-party domains, the likelihood of desired changes will be larger too.

johnwilander commented 3 years ago

Think of it this way: Everything that is site configuration has to be static for reports to be sent. If anything changes, the browser will bail out and delete all matching data. All dynamism has to go into the source ID and the trigger data. Those values are specified to exactly control how much data can get through in a report. If anything else is allowed to be dynamic, that can be made to carry more bits of data which will 1) violate the intended privacy guarantees of the feature, and 2) lock those extra bits to that particular dynamism and it's always better to just boost source ID and trigger data if we believe we can protect privacy anyway. Better because source ID and trigger data are easier to reason about, easier to explain, and provide maximum flexibility to developers.

jbpringuey commented 3 years ago

Does that work if https://www.tracking.adtech.com is the only valid and static tracking FQDN to record conversion for adtech.com in the internet ? If that FQDN changes, all impressions, clicks or conversions for that adtech.com would be deleted. That FQDN could be specified after attributiondestination html attribute as attributionreporting.

Let's summarize:

When an ad is displayed, a new attribute (attributionreporting for example) is specified alongside the existing ones attributionsourceid and attributiondestination
An attribution reporting domain can register only one FQDN for a given browser instance
To enforce this limitation, we can imagine a mechanism embedded in the browser: the first time a domain registers an attributionreporting, it is stored somewhere in the browser. As long as the same FQDN is declared for a given domain, everything works. The day a domain decides to change its attributionreporting , all pending attributions are invalidated and the new FQDN replaces the old one in the dedicated storage of the browser.
All HTTP tracking calls going to a attributionreporting are proxyfied using the incoming mechanism in ITP

michael-oneill commented 3 years ago

How do you stop each browser instance getting a different value for attributionreporting, or how does one browser instance detect that?

privacycg / private-click-measurement

Why attribution reports cannot go to third-parties and to anything else than the registrable domain #57

Why Not Attribution Reports To Third Parties?

Even Worse With Custom eTLDs

Why Not Attribution Reports To Subdomains?