patcg / private-measurement

A place to discuss Private Measurement
Other
11 stars 0 forks source link

Interoperable Private Attribution (IPA) #9

Open eriktaubeneck opened 2 years ago

eriktaubeneck commented 2 years ago

@benjaminsavage, @martinthomson, and I have been working on a proposal, "Interoperable Private Attribution (IPA)" that addresses the aggregate attribution measurement use case, similar to those listed in patcg/private-measurement#8.

We'd love to have this considered and discussed at the January PATCG meeting, for consideration in maturing it further through collaboration among this community group.

tgeoghegan commented 2 years ago

Section 4.1 states that the "aim is for IPA to be compatible with the Privacy Preserving Measurement (PPM) specification". Does that mean you intend to express IPA as a VDAF?

eriktaubeneck commented 2 years ago

Does that mean you intend to express IPA as a VDAF?

Yes, our intention is to work towards that. There are a few major components, some which likely require expression as a new VDAF, but we hope to leverage the existing work with prio3 and/or poplar1 where possible.

alextcone commented 2 years ago

Consider me a strong +1 in support of this getting on the agenda.

ekr commented 2 years ago

In case it helps, I spent a bunch of time working through the math of IPA in some detail at: https://educatedguesswork.org/posts/vaccine-tracking/. I was interested in another application, but if you found the ElGamal blinding and shuffling a bit had, this might help.

ansuz commented 2 years ago

@eriktaubeneck It seems like permissions are required to view the draft on google docs. Is a publicly accessible version of the draft available anywhere else?

eriktaubeneck commented 2 years ago

@ansuz document is back up, though it is now read-only.

gsnedders commented 2 years ago

While the proposal mentions:

We would also like to call out the work happening in the WICG on the Attribution Reporting API and Privacy Preserving Ads, which was highly influential to this work.

It would be nice for the proposal to directly compare itself with such prior work (and also the Privacy CG's Private Click Measurement). As it is, it is not immediately clear as to what the motivations are behind this proposal rather than furthering work on developing the other proposals.

ShivanKaul commented 2 years ago

Is there a reason all the questions on the doc were removed + ability to comment revoked?

Lexicality commented 2 years ago

Presumably because the internet is currently extremely angry about this?

benjaminsavage commented 2 years ago

Is there a reason all the questions on the doc were removed + ability to comment revoked?

The document was completely defaced, fully deleted with "suggestions" and replaced with vulgarities. As such, the document is now "read-only" access.

ShivanKaul commented 2 years ago

Ah, I see, sorry to hear that. Is the plan to move it to GitHub? Last time I read it there were a few undefined terms and the flow was not entirely clear to me, would be good to get clarifying answers.

bmayd commented 2 years ago

The document was completely defaced, fully deleted with "suggestions" and replaced with vulgarities. As such, the document is now "read-only" access.

@benjaminsavage Does this lead you to consider docs unsuitable as a collaboration tool for this work or do you think it can be avoided going forward? I don't want to continue advocating for their use if the latter is not the case.

Med1cinal commented 2 years ago

Does this lead you to consider docs unsuitable as a collaboration tool for this work or do you think it can be avoided going forward? I don't want to continue advocating for their use if the latter is not the case.

this might be a solution

santirely commented 2 years ago

Having looked at only the non-technical presentation, I have a couple of comments / questions.

  1. Where would the match keys be stored in the device / browser / OS?
  2. Could we adapt this solution so that it isn't so dependent on companies with a "large footprint"? Although as proposed this does appear to be the most logical solution, it also poses a huge limitation in my view. What happens if suddenly Facebook or Google decide they'll only share their match keys with companies that play well with them?
eriktaubeneck commented 2 years ago

@medicinalcocaine3434 the document was using suggested edits and comments, it was with suggested edits that the document was defaced.

@santirely if you take a look at the technical proposal, you'll find details on your question. Briefly:

Where would the match keys be stored in the device / browser / OS?

We are proposing a new read-only API, which the browser/OS would expose.

Could we adapt this solution so that it isn't so dependent on companies with a "large footprint"?

Any website/app is able to write a match key, so it's not dependent on any set of companies. However, the more cross-device coverage a given companies match key has, the more accurate attribution that uses that match key will be.

What happens if suddenly Facebook or Google decide they'll only share their match keys with companies that play well with them?

We are proposing that any site/app be able to reference any match key. Match keys are not shared, and the ability to reference them is not controlled by the companies that set them.

santirely commented 2 years ago

Ok, that's interesting. A couple follow ups:

Any website/app is able to write a match key, so it's not dependent on any set of companies. However, the more cross-device coverage a given companies match key has, the more accurate attribution that uses that match key will be.

This is true but a significant portion of the value created by the proposal is cross-device tracking, and these companies adopting the solution would be important for that to actually work. I was aiming at something like: What if other smaller apps / publishers could pool their match keys in a way that benefits everyone? For example, a gaming studio like say Epic will have tons of mobile and CTV match keys but almost no web based ones. Opposite is true for someone like The New York Times. Could they create a sort of coop there?

What happens if suddenly Facebook or Google decide they'll only share their match keys with companies that play well with them?

This seems ideal, but isn't that a potential drawback for someone that can provide cross-device atribution by itself? What's Meta's or Google's incentive to be the world's match key providers there?

Lexicality commented 2 years ago

The proposal also seems to assume that the user is logged in to at least one match key provider. What happens if the user is not? Does the browser make up a device identifier? Does it encrypt null? (Presumably collisions could abound there) or does it return an error to the calling script? If a user doesn't want to sign in to Facebook (or uses Facebook Container) does that mean they will never be able to be attributed to a conversion?

Lexicality commented 2 years ago

To give a specific example, say I'm a new user. I install Firefox for the first time and see a full screen advert for Pocket and go "wow this looks great" and decide to sign up for the premium service. Since my browser is in a completely fresh state I won't have any match keys set up at all. How is Mozilla going to know if buying pocket was a good idea or not?

eriktaubeneck commented 2 years ago

This seems ideal, but isn't that a potential drawback for someone that can provide cross-device atribution by itself? What's Meta's or Google's incentive to be the world's match key providers there?

In the absence of 3rd party cookies, cross-site (including cross-device) attribution won't be possible "by itself". It will require a new purpose constrained API. All companies incentives to participate will be to power their own attribution (with the side effect of enabling all attribution.)

eriktaubeneck commented 2 years ago

The proposal also seems to assume that the user is logged in to at least one match key provider. What happens if the user is not? Does the browser make up a device identifier?

This is still yet to be determined. One idea is that the device could generate a random match key, which would at least default to "same device attribution".

If a user doesn't want to sign in to Facebook (or uses Facebook Container) does that mean they will never be able to be attributed to a conversion?

This would entirely depend which match key providers the source sites and trigger sites are using. If those sites are using match key providers that the user is not logged into, then they attribution would likely be missed. 100% coverage have never been possible, but this API is designed to create as much coverage as possible, without enabling user level tracking.

santirely commented 2 years ago

In the absence of 3rd party cookies, cross-site (including cross-device) attribution won't be possible "by itself". It will require a new purpose constrained API. All companies incentives to participate will be to power their own attribution (with the side effect of enabling all attribution.)

So what you're saying is that Meta for example would be OK sharing their "match keys" with say Twitter (since Meta's reach is significantly higher, why would Twitter use its own?) because that way the advertiser can also use the match key (and Meta wouldn't be able to match its conversions with the advertiser otherwise). And since they can't decide to share only with the advertiser, they'd be fine with sharing with everyone else.

Seems pretty far-fetched to be honest. Also, isn't Meta's or Google's reach enough that they can match against advertiser's first party logged in data and get even better results?

Lexicality commented 2 years ago

So what you're saying is that Meta for example would be OK sharing their "match keys" with say Twitter (since Meta's reach is significantly higher, why would Twitter use its own?)

If Twitter wants to use Facebook's match keys, that means all the users they show adverts to also need to be logged in to Facebook. This means Twitter needs to incentivise its users to log in to Facebook which directly benefits Facebook.

That's the entire reason Facebook has come up with this proposal - in order for it to work the vast majority of the internet needs to have been identified in some fashion by Facebook, so everyone that uses it will pass their users via Facebook and let them slurp up their data.

You can say "well it works with any provider" but to work effectively it needs a major provider and Google are off doing their own thing so ...

bmilekic commented 2 years ago

If Twitter wants to use Facebook's match keys, that means all the users they show adverts to also need to be logged in to Facebook. This means Twitter needs to incentivise its users to log in to Facebook which directly benefits Facebook.

I don't see how that's true. If Twitter is the publisher, then it can ask their advertisers to register trigger events referencing only twitter's match keys. There is no need for the facebook match keys in that scenario, especially since the ads are running on twitter and so the lack of twitter user ID implies no possible match with advertiser target events.

For non-FB smaller publishers, the proposal provides an important theoretical benefit, in that a publisher can choose to register source events leveraging facebook, twitter, and other third-party keys. In an IPA implementation supporting multiple match keys, this ultimately benefits the publisher as it increases potential match rates with advertiser data.

It remains to be seen what the motivation could be for a match key provider to act as such, but presumably "having an advertising business" would be one motivating reason. I believe that making the match keys usable by other parties in that context is more fair than doing the opposite.

Lexicality commented 2 years ago

I don't see how that's true. If Twitter is the publisher, then it can ask their advertisers to register trigger events referencing only twitter's match keys. There is no need for the facebook match keys in that scenario, especially since the ads are running on twitter and so the lack of twitter user ID implies no possible match with advertiser target events.

I agree it doesn't make much sense, but in the hypothetical that they did want to only use someone else's match key (for whatever reason) then I think my point still stands

It remains to be seen what the motivation could be for a match key provider to act as such, but presumably "having an advertising business" would be one motivating reason. I believe that making the match keys usable by other parties in that context is more fair than doing the opposite.

I don't think fairness comes in to many business decisions. It costs Facebook nothing to allow its competitors to use its match keys, and if everyone relies on them they gain a position of power over the discourse, even if it's just an implicit one.

To be clear, just because I feel like this proposal further entrenches the big players in the ad business by relying on centralised identity services doesn't mean I think it's a bad proposal. As long as the cryptographic stuff works and the ad networks are somehow coerced into dropping their other tracking methods, this is a big step up. But on the other hand if the crypto stuff has a hidden weakness in it and Facebook run one of the "trusted" servers, this is a terrible idea.

benjaminsavage commented 2 years ago

A few comments in response the the thread so far:

  1. On the topic of "what if no match keys are set"

As @eriktaubeneck said - I like the idea of the device just generating a random matchkey. That way the API seamlessly defaults to "same-device-only" attribution, which is at least at par with other proposals.

  1. On the topic of "Who can use a match key once it is set?"

The reason we proposed allowing any company to benefit from match keys set by any other participant, was specifically to try to avoid any kind of system which could be abused by large established players. As @santirely mentions, this would give them a lot of leverage to, as he says, choose to only share access with businesses who "play well with them". We opted for an "open reference" proposal specifically to avoid this type of risk.

  1. On the topic of: "What incentive would a company with a large footprint have for setting an open-reference match key?"

As @eriktaubeneck points out, browsers and mobile operating systems are rapidly clamping down on "tracking". Various regulations are doing the same. This means that all businesses (even those with a large footprint) as steadily losing the ability to accurately count the number of conversions attributable to advertising. In a theoretical future world where cookies and device identifiers are all gone, and fingerprinting is impossible, having a "large footprint" will be useless from the perspective of counting conversions which occur off-network on other apps and websites. In such a world, if the only option available for counting conversions is a highly private one, like IPA, then I believe businesses who sell ads will use it (they won't have a choice). In that world, they'll have two options: (i) Do not set a match key. Use a match key set by some other entity (ii) Set a match key - accepting that anyone else who wants to can also use it.

Each entity will have to weigh these alternatives. For a business with a "large footprint" of users who sign in across multiple devices, here is how I think these choices will look: (i) Do not set a match key: If other match-keys are from businesses with a smaller network of users logged-in across devices, taking this approach will have the un-desireable side-effect of undercounting the true number of conversions their ads actually drive. In summary: Less accurate measurement. (ii) Set a match key: This will result in more accurate ads measurement - with higher counts of attributed conversions, which more accurately measures the number of conversions their ads drive. As a side-effect however, all competitors will also benefit from more accurate measurement of their ads. In summary: More accurate, but more accurate for everyone.

I posit that there exist businesses for whom the calculus is in favor of option (ii), more accurate measurement being more beneficial than everyone having less accurate measurement.

  1. On the topic of "does this require users to be logged into Facebook?". In the proposal, we talk about the prospect of supporting multiple match keys. We think we can support this without needing to give up any privacy benefits. If that is true, then it would seem optimal for any consumer of this API to select a basket of match-keys which collectively provide good coverage. This has the additional benefit of minimizing the reliance on a single point of failure. I can envision a future where it is common to specify a handful of "large footprint" match key providers to get a good baseline, a few region specific ones to cover parts of the globe which would otherwise be poorly covered, potentially one's own match-key, and finally falling back on the random, per-device specified match key which essentially just provides "same-device only" attribution.

I think all parties (including "large footprint" entities) would all have similar incentives to push them in this fashion.

We've also put a lot of time and thought into trying to ensure there isn't coupling between entities. We think we can design the system in such a way that we do not require collaboration. That is, we want a system where any advertiser who runs ads across N platforms can independently specify which match-keys they want to use, without needing those platforms to all agree with them, or all need agree on something.

5.

As long as the cryptographic stuff works and the ad networks are somehow coerced into dropping their other tracking methods, this is a big step up. But on the other hand if the crypto stuff has a hidden weakness in it and Facebook run one of the "trusted" servers, this is a terrible idea

First of all, I assume that Facebook / Google / any ad-tech company will never be trusted to operate a helper server =). This will be enforced by browsers. They'll have to decide which public keys they are willing to use to encrypt reports. I cannot imagine a world in which Firefox would trust Facebook enough to encrypt these events using Facebook's public key =). I'm assuming we will see non-profits with strong privacy reputations operating the servers, or possibly the types of organizations which operate Apple's "Private Relay" service.

Secondly: Yes, exactly. This proposed system would be a big step up for privacy compared to the status quo mechanisms used to count conversions. I have no expectation that browsers and mobile operating systems will stop trying to clamp down on fingerprinting. Actually, if anything I expect them to accelerate those efforts. I also expect to see more and more regulation along these lines.

That the math works out, and we have a strong privacy guarantee is the key. This is why we are trying to work out in the open - we think that's the best way to find all the problems / issues, and to get help finding solutions to them. We've already benefitted tremendously from outside input. @betuldurak found a really clever attack that a malicious helper node could do. I'm really grateful to her for telling us about it! We're working on finding a solution as we speak.

I think the path towards standardization looks like a bunch of iterations out in the open, publishing papers, getting feedback, addressing problems, repeat. I hope that we can eventually converge on a design that is super solid. I wouldn't expect browser vendors to feel comfortable shipping an API like this unless a bunch of independent academics were all convinced that it met our design goals.

chris-wood commented 2 years ago

I think the path towards standardization looks like a bunch of iterations out in the open, publishing papers, getting feedback, addressing problems, repeat. I hope that we can eventually converge on a design that is super solid. I wouldn't expect browser vendors to feel comfortable shipping an API like this unless a bunch of independent academics were all convinced that it met our design goals.

Agreed on the approach =) What's the best way to follow along with the proposed solution(s) that you're working on to address @betuldurak's attack? Is the attack documented anywhere?

martinthomson commented 2 years ago

https://educatedguesswork.org/posts/ipa-overview/#appendix%3A-linear-relation-attacks perhaps.

We've initiated a few discussions with cryptographers; nothing public as yet.

sthaase commented 2 years ago

What role do regulatory requirements such as GDPR / ePrivacy in Europe play in the solution discovery & design from your perspective? That is one aspect I rarely read about in these proposals, yet I believe that this should be an integral part of the problem definition and solution design.

Looking at IPA specifically for example, I believe that data protection authorities might categorize the match key as personal data (https://gdpr.eu/eu-gdpr-personal-data/) and storing it on the users device would therefore require user consent. Would you just accept that as a given, or could solutions be more tailored towards regulatory requirements (in a sense: try to discover solutions do not require user consent to not end up modeling 30% of conversions that are lost due to tracking opt-outs).

benjaminsavage commented 2 years ago

What role do regulatory requirements such as GDPR / ePrivacy in Europe play in the solution discovery & design from your perspective? That is one aspect I rarely read about in these proposals, yet I believe that this should be an integral part of the problem definition and solution design.

@alextcone and @darobin shared some of their thoughts about this topic in another thread:

@alextcone's comment: https://github.com/patcg/proposals/issues/5#issuecomment-1034987044

@darobin's comment: https://github.com/patcg/proposals/issues/5#issuecomment-1035273304

AramZS commented 2 years ago

Noting that the group has chosen to pick up private measurement and the associated proposals as part of its initial focus. This proposal will be considered as part of that process. Accordingly, I am moving this issue to the stand alone repo created to manage that conversation.

csharrison commented 2 years ago

I reread the privacy budgeting section of IPA and I have a small concern. It seems like the proposed privacy "grain" / unit is site x user but we implement that via a grain of site x match key. However, match keys and users are not the same and this seems like it is possibly abusable.

For a simple example say you are targeting a single sensitive user across 3 devices. The identity provider could intentionally use a separate match key for each device, and send separate queries for each match key that consume budgets independently for each query, with the intention of leaking as much about this user as possible. Note that by doing this we sort of eliminate the possibility of true cross-device attribution though.

Making match keys easy to swap out on a single device (possibly across apps) makes this attack worse, but that seems fixable with OS-level support. Best case I can see is that this proposal would achieve (in the worst case) is site x user x device privacy.

LMK if I am missing something here though.

csharrison commented 2 years ago

Making match keys easy to swap out on a single device (possibly across apps) makes this attack worse, but that seems fixable with OS-level support. Best case I can see is that this proposal would achieve (in the worst case) is site x user x device privacy.

Actually, I am not sure this is easy. I think we would need to make sure that two "different entities" (e.g. facebook.com and facebook2.com) don't try to measure the same events with two different match keys (user123-A and user123-B), to double up their counts. In other words I think match key consistency needs to be 1:1 with websites.

martinthomson commented 2 years ago

This is an attack we've discussed (though we might not have captured it in documentation). You need two controls:

  1. The user agent cannot act on a change to a match key until the start of a new epoch. (Other rate limiting is possible, but this is easiest.) This prevents the identifier from being swapped out on a per-request basis.
  2. Sites need to commit to a set of match key providers so that sites can't switch out providers in the same way (to your second point). It might be acceptable here for each user agent to independently enforce this, though this means that you really end up with (user+device/agent) x site as your grain. Ideally, some sort of consistency system would be used to get back a true user+site grain; it's possible the helpers can play some role in guaranteeing that.
csharrison commented 2 years ago

Yes (1) makes sense and I think it's specified in the doc. (2) doesn't fully cover the second comment though. I could ask all my partner sites to pre-commit to use fb1.com / fb2.com as identity providers without doing any kind of rotation / switching. Then I just double up all my source / trigger events (one per each match key provider) and get effectively double counts for the same privacy budget.

I think what you might have to do (and maybe this is what you were implying with (2)) is that if you pre-commit to N match key providers, you would effectively have more noise added to your aggregations scaled with N to keep similar privacy guarantees and get closer to the user+site grain.

eriktaubeneck commented 2 years ago

In the documentation, we originally proposed providing match_key_provider as an argument in the generateSourceEvent and generateTriggerEvent functions. However, this is vulnerable to the attack to describe @csharrison.

@martinthomson's suggestion of committing to a set {match_key_provider_1, ...} would be for every site. Then, when calling generateSourceEvent or generateTriggerEvent, all of the match keys that site had committed to would be included. (We'd want to be able to cap this to some reasonable number N.)

Later, in the privacy budget management step, we'd assure that both:

  1. Every individual match key can only contribute up to L1 to the aggregation.
  2. Every individual match key provided has the consumed amount deduced from their privacy budget.

I believe this should allow for properly preventing the attack you describe, without needing to scale the noise by a factor of N.

benjaminsavage commented 2 years ago

I could ask all my partner sites to pre-commit to use fb1.com / fb2.com as identity providers without doing any kind of rotation / switching. Then I just double up all my source / trigger events (one per each match key provider) and get effectively double counts for the same privacy budget.

Like @eriktaubeneck says, I think the idea is that if you want to use multiple match-key providers, that's fine, but each event will contain ALL of them. You don't get to pick and choose. I think that eliminates this attack.

If you ask all your partner sides to pre-commit to use fb1.com / fb2.com as identity providers, then each time they generate an IPA event, it'll use both match keys (if present, defaulting to a randomly generated but stable-per-device value if not present).

csharrison commented 2 years ago

Thank you @eriktaubeneck and @benjaminsavage that indeed resolves the concern. I was assuming we could have separate calls like:

generateSourceEvent("fb1.com");
generateSourceEvent("fb2.com");

And having the match keys unioned during attribution would be optional, not required. If that's not possible then it effectively means that sites are forced into "identity union" model when working with multiple identity providers / third parties which is an interesting consequence. You could imagine ad-tech to ad-tech abuse where one ad-tech starts using a uniform match key across all users to mess up another ad-tech's measurement on the same site. Would need to think more if this is in any way a practical concern.

benjaminsavage commented 2 years ago

You're totally right @csharrison. My thinking is that given each site owner can independently choose which match-key providers they work with, any provider who does something nasty like choosing a uniform value of the match key, will instantly lose trust and develop a bad reputation - leading to nobody using them again.

eriktaubeneck commented 2 years ago

@csharrison another approach would be to structure the inclusion of matchkeys in the report as a key:value of provider:matchkey, instead of just a set of matchkeys, i.e.

{
    "provider1.com": matchkey_1,
    "provider2.com": matchkey_2,
    ...
}

Then, at query time, you could tell the aggregators: "only join on provider1.com". The aggregators could respect that choice in the joining, but still account for budgeting against the full set of matchkeys.

This still wouldn't fully solve the abuse scenario, however, because if someone were to simply set a uniform matchkey, that would likely still disrupt the budget accounting and contribution capping (in which case you'd still need reputational effects which @benjaminsavage proposes.)

csharrison commented 2 years ago

Another question for the IPA proposal. The document mentions it should be possible for third parties to make requests on behalf of other 1Ps. I agree this is a good feature. One attack I didn't see mentioned is malicious parties crafting fake data in the hope of stealing budget from the 1P, by pretending to query on behalf of the 1P.

There are many mitigations for this, but it would be good to spell them out. The most obvious one is that if the match key space is high entropy enough, this is just straight up difficult. However, I don't know if we want to design something more robust such that e.g. 1Ps need to attest to working with certain 3Ps up front.

eriktaubeneck commented 2 years ago

@csharrison agreed that this is underspecified, and this would be a great area to get more clarity on.

[Administrative side note, I opened a request to get a repo specifically for IPA so we can have issues to dedicated topics, and even put together pull requests for docs outline more details in these areas as they emerge.]

A few thoughts specific to to this question:

One attack I didn't see mentioned is malicious parties crafting fake data in the hope of stealing budget from the 1P, by pretending to query on behalf of the 1P.

In the case where the 3P has actual source_events and trigger_events, this could be possible without even generating fake data. We allow for individual events to be used in multiple queries, within the privacy budget, so this could be used to exhaust it. In this case, I don't think that making the match key space high entropy would actually work.

In the case where the 3P doesn't have actual events, but is just trying to disrupt some 1P's budget, the high entropy match key space would work.

I don't know if we want to design something more robust such that e.g. 1Ps need to attest to working with certain 3Ps up front.

In the first scenario, it should be possible for the 1P to prevent a 3P from getting actual events by not sending them or installing their "pixel" code.

In the second scenario, it seems like a high entropy match key is enough (say 64-bit) where it would be far too expensive to run a query that would actually have meaningful impact. Let's suppose (very conservatively) that it only takes 1ms to generate a fake event - to cover 0.4% (1/256) of the space it would take over 2M years of compute time to generate all those events. And that's not even starting to think about actually running that query...

That said, if a 1P wants to work with more than one 3P, then we do probably need a way for that 1P to assign specific portions of its budget across those different 3Ps, which may necessitate the attestation design you mention.

csharrison commented 2 years ago

[Administrative side note, I opened a https://github.com/patcg-individual-drafts/admin/issues/1 to get a repo specifically for IPA so we can have issues to dedicated topics, and even put together pull requests for docs outline more details in these areas as they emerge.]

Thanks, yeah. This one issue is getting very cumbersome haha.

In the first scenario, it should be possible for the 1P to prevent a 3P from getting actual events by not sending them or installing their "pixel" code.

It's worth thinking through this scenario to see if we could detect / tolerate this. As far as I understand things, it is notoriously difficult for 1Ps to make configuration changes on their sites, so if we are relying on that to deter cheaters it's not ideal.

In the first scenario, it should be possible for the 1P to prevent a 3P from getting actual events by not sending them or installing their "pixel" code.

Hm, this made me look back at the IPA doc to see how privacy budgeting is done. For a given report in the MPC system, how do we know the site it is associated with for purposes of budget? The issue I am hoping we can avoid is something like a re-randomization attack where a 3P gets real events for advertiser A but can somehow use the match key to steal budget from advertiser B while having the report look new. I think we need to make sure the budget keys are not tamperable basically.

I think I agree with you about the high entropy protecting us a great deal from the "guessing" attack. If we can show that's the worst an adversary can do I might be comfortable with it.

eriktaubeneck commented 2 years ago

It's worth thinking through this scenario to see if we could detect / tolerate this. As far as I understand things, it is notoriously difficult for 1Ps to make configuration changes on their sites, so if we are relying on that to deter cheaters it's not ideal.

I agree if we're talking about a cheater using information from site A to impact something about site B. However, if site A is willing to give a "cheater" ability to execute JS on their site, how we can prevent anything beyond that.

For a given report in the MPC system, how do we know the site it is associated with for purposes of budget?

This is a good question. I have a few ideas here, but there are some tradeoffs. I'll open an issue for this specifically once the other repo is created.

csharrison commented 2 years ago

I agree if we're talking about a cheater using information from site A to impact something about site B. However, if site A is willing to give a "cheater" ability to execute JS on their site, how we can prevent anything beyond that.

Great point. There might be some nuance here with iframes, but even still there is a detection problem that we should think through. This goes back to the general problem of supporting multiple separate reporting origins though which we should flesh out. It ends up being a complicated coordination problem (and possible denial-of-service vector) if everyone has to share a single budget. Needing to involve the advertiser in it makes this even tougher.

One more question about IPA behavior. I want to confirm that attribution across multiple queries works correctly. Here's an example: imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns. This is my understanding of how this would work in IPA.

The advertiser will send 3 queries to the system:

  1. {Campaign1's source events, all trigger events}
  2. {Campaign2's source events, all trigger events}
  3. {Campaign3's source events, all trigger events}

If IPA treats these queries completely independently, then attribution does not take into account source events from separate queries. That is, a hypothetical user journey like {Campaign1 source event, Campaign2 source event, Campaign 3 source event, trigger} will end up contributing a count to each of the three queries above, causing double counting.

One way to make this work would be to first run "global attribution" with the union of the events in all the queries, and then separately evaluate each query separately from the pool of globally attributed sources/triggers. I couldn't tell if this was how the protocol was intended to work though.

benjaminsavage commented 2 years ago

Another question for the IPA proposal. The document mentions it should be possible for third parties to make requests on behalf of other 1Ps. I agree this is a good feature. One attack I didn't see mentioned is malicious parties crafting fake data in the hope of stealing budget from the 1P, by pretending to query on behalf of the 1P.

Here's how I've been thinking about this:

When a report collector makes an IPA query, it will cost them some amount of money. You have to pay the MPC helper nodes for the compute you use. This implies the existence of some kind of registration process whereby a site / app signs up to run IPA queries, proves ownership of the app / site, and inputs an associated payment instrument.

So I am assuming all IPA queries will be authenticated server-to-server calls. Authentication parameters must be provided to run the query. As such, it should be impossible for anyone but the 1st party, or their legitimate delegate to run queries. If a delegate abuses their permissions, the 1st party should be able to revoke their permission to run IPA queries on their behalf.

benjaminsavage commented 2 years ago

One more question about IPA behavior. I want to confirm that attribution across multiple queries works correctly. Here's an example: imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns. This is my understanding of how this would work in IPA.

The advertiser will send 3 queries to the system:

{Campaign1's source events, all trigger events} {Campaign2's source events, all trigger events} {Campaign3's source events, all trigger events} If IPA treats these queries completely independently, then attribution does not take into account source events from separate queries. That is, a hypothetical user journey like {Campaign1 source event, Campaign2 source event, Campaign 3 source event, trigger} will end up contributing a count to each of the three queries above, causing double counting.

One way to make this work would be to first run "global attribution" with the union of the events in all the queries, and then separately evaluate each query separately from the pool of globally attributed sources/triggers. I couldn't tell if this was how the protocol was intended to work though.

In the event an advertiser wants to evaluate the relative performance of 3 campaigns (which they might have purchased from different ad-sellers) I assume that they would NOT issue three separate queries as you’ve shown. This would wind up hitting their differential privacy budget three times for the same set of trigger events. They’d be far better off running a single query with all of the source events from all three campaigns, and all of the trigger events. This would make much better use of their budget, as well as enable “global attribution”, where we can avoid double counting.

To be clear, I understand this is a significant departure from how things work today. Today Facebook ads manager shows just an FB view of things. In an IPA world, it would not be possible to show them this. It wouldn’t be an efficient use of their privacy budget. It would be much more similar to the mobile app ecosystem where advertisers utilize 3rd party “mobile measurement partners” that give them a unified view across all their ad buying channels, preferring to view reporting there and eschewing platform-specific reporting channels.

csharrison commented 2 years ago

In the event an advertiser wants to evaluate the relative performance of 3 campaigns (which they might have purchased from different ad-sellers) I assume that they would NOT issue three separate queries as you’ve shown. This would wind up hitting their differential privacy budget three times for the same set of trigger events. They’d be far better off running a single query with all of the source events from all three campaigns, and all of the trigger events. This would make much better use of their budget, as well as enable “global attribution”, where we can avoid double counting.

I think I might be missing something. Is this use-case possible to achieve with IPA:

imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns.

i.e. I want a break-out that says: Campaign1: led to 10 conversions Campaign2: led to 15 conversions Campaign3: led to 150 conversions

My thought from the doc was this is accomplished via carefully sending relevant source events, but it seems like there is some other way this should be done. Here is the relevant piece from the doc:

Note that source.example can use its own context and the context provided by trigger sites to group these queries into relevant sets. For example, if the source reports were a set of ad impressions, source.example could choose to run a query for a specific campaign, and only include trigger reports for items relevant to that campaign.

Now that is specific to a source query, but I assumed you'd do the same for trigger queries like the one I described.

eriktaubeneck commented 2 years ago

I think I might be missing something. Is this use-case possible to achieve with IPA:

imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns.

i.e. I want a break-out that says: Campaign1: led to 10 conversions Campaign2: led to 15 conversions Campaign3: led to 150 conversions

My thought from the doc was this is accomplished via carefully sending relevant source events, but it seems like there is some other way this should be done. Here is the relevant piece from the doc:

Note that source.example can use its own context and the context provided by trigger sites to group these queries into relevant sets. For example, if the source reports were a set of ad impressions, source.example could choose to run a query for a specific campaign, and only include trigger reports for items relevant to that campaign.

Now that is specific to a source query, but I assumed you'd do the same for trigger queries like the one I described.

Our wording in the doc may not have been super clear - there are two different cases to consider here.

The first case is the one you mention, you would want to issue a single query, with all the events. It would be something like the following SQL query:

select 
    source_event.campaign_id
  , count(trigger_event.event_id)
  , sum(trigger_event.value)
from
    source_events
    join trigger_events
    on <matchkeys and attribution logic>
group by
    source_event.campaign_id

The second case is where there are multiple distinct products involved, such as:

{
    (campaign_1a, campaign_1b, ...) : product_1,
    (campaign_2a, campaign_2b, ...): product_2,
    ...
}

In this case, since these queries can be constructed entirely independently, the advertiser running the query should be able to bifurcate them appropriately and run the same query as above, without having an effect on the results. In that case, having less data should be more efficient, and also not exhaust unnecessary privacy budget. It would also prevent the need for more complicated attribution logic in the MPC (since you'd only want attribution within that appropriate mapping.)

csharrison commented 2 years ago

Thanks @eriktaubeneck , I think I missed the piece where we can annotate source events by their relevant campaign ID. I wasn't sure if that was supported.

csharrison commented 2 years ago

I guess I will follow-up: how much extra information can we pack into the events? One of the benefits of creating queries as a "bag of relevant events" is that we can use arbitrarily complex information to structure the queries. Once the splitting has to happen within the protocol though, it becomes harder, especially with MPC. Could you imagine us supporting many dimensions of features beyond campaign IDs in IPA?