Open csharrison opened 2 years ago
Thanks for filing this issue @csharrison!
We have been discussing this issue, and our "IPA end-to-end" document is trailing a bit behind our current thinking.
A few thoughts:
In terms of legitimate use-cases related to "lookback windows" - here's how I'm currently thinking about it:
- For "Trigger Fanout Queries", that is where an advertiser has some conversion events (trigger events) and they want to understand what ads drove them (which source events) - I think we can potentially support long-lookback windows without enabling this "worst case" style situation. For starters, let's imagine that the "report collector" specifies an "epoch" argument when making an IPA query. This argument would let the helpers know which epoch's privacy budget this query is meant to consume. Furthermore, assuming we have attestation for the "epoch", we could simply ask the helpers to validate that all of the trigger events come from the epoch that was specified in the query. It would be OK to allow "source events" from preceding epochs (within some range - that would be limited by how often the helpers rotate their encryption keys, TBD. I know in the "auto" vertical advertisers like 90 day lookback windows - so that would be the aim - let's see if we can support that)
This sounds pretty good to me. This could allow limited replay of source events (associated with different triggers across different epochs), but if we bound it reasonably it might be OK (90 days might be pushing it but I think we can hash that out later). I am a bit curious on what the query model will actually look like for some campaigns with long lookback windows though, since the system will never tell you explicitly when a source event has been "counted" and should not be considered for triggers in future epochs. This design might actually encourage replay and double counting of sources if we're not careful even from legit parties. While I would like to recommend that they wait 90 days to issue a massive query with 90 days of data that is probably unrealistic :)
- For "Source Fanout Queries", that is where an app or website that displays ads wants to compute some kind of "calibration" or "experimentation" use-case, counting attributed conversions across all advertisers, I have less clarity. Perhaps you have a suggestion! Perhaps we could just do the inverse - that is, validate that all of the "source events" come from the epoch specified in the query. This would require the Report Collector to "reserve" some privacy budget for a few weeks post the end of an epoch if they wanted to try to do a long lookback window query, then run a query to use that reserve budget up after they've received all of the potentially matching trigger events.
This seems reasonable, I think this use-case typically doesn't require super long lookback windows although I'm not an expert. Part of me would like to see the mechanism be unified across the different query types for simplicity though.
Speaking of lookback windows, I am actually not sure how lookback windows work in IPA besides for the selection of input events :) I'll file another issue for this.
Speaking of lookback windows, I am actually not sure how lookback windows work in IPA besides for the selection of input events :) I'll file another issue for this.
I've just filed another issue on this topic: https://github.com/patcg-individual-drafts/ipa/issues/16
My thinking all along was that we would bind the encrypted match key to the epoch in which it was generated so that it can't be used outside of that. And then follow the process @benjaminsavage described. These each have their own limitations, but those seem manageable.
We should absolutely bind the epoch to the matchkey when provided by the user agent. I'll open a PR to update the end-to-end doc to that effect.
Exactly how that epoch is used in the query semantics seems to still be an open question. I generally agree with @csharrison that it's ideal if the API is unified, and there aren't different semantics for the different queries types.
In the current proposal, source and trigger fanout queries are already tied to the source and trigger site/apps (respectively) for the purpose of budgeting. For any given query, let's call these match keys used for the budgeting as the "primary match keys". If I'm understanding the above correctly: it seems reasonable that we only need to limit the epoch of the "primary matchkeys" (though not ignoring some global expiring of all match keys.)
This opens a few questions (which seem to be policy questions, not technical questions):
I think if the epoch is bound to the match key then my opinion is to err on the side of more flexibility for the primary match key, since we know we can do the correct budget enforcement.
I agree with Charlie. The only necessary constraint here is the number of epochs the helpers agree to track budget for. So maybe you don't get to go back 3 years, but a month or 3 or 6 might be fine.
In the current doc timestamps for each event are provided by the report collector with no client-side attestation. This allows a report collector to replay old events by modifying the timestamp accordingly. This attack doesn't necessarily break the desired differential privacy protection, since we'd still bound the impact of any one event per epoch, but it allows some "worst case" style attacks where one sensitive event could get queried again and again every epoch.
It might be the case that we accept this kind of leakage, but I think it's worth a discussion of whether we'd want to restrict in some way events from the client from participating in unbounded epochs. Note that doing this might break legitimate use-cases ("lookback windows" for attribution that are longer than an epoch), so we should proceed carefully.