mozilla / standards-positions

https://mozilla.github.io/standards-positions/
Mozilla Public License 2.0
650 stars 72 forks source link

Private Aggregation API #805

Open alexmturner opened 1 year ago

alexmturner commented 1 year ago

Request for Mozilla Position on an Emerging Web Specification

Other information

This proposal introduces a generic mechanism for measuring aggregate, cross-site data in a privacy preserving manner. This general-purpose API can be called from isolated contexts that have access to cross-site data, i.e. a Shared Storage worklet or Protected Audience (formerly FLEDGE) script runner. Within these contexts, potentially identifying data is encapsulated into "aggregatable reports". To prevent leakage, the cross-site data in these reports is encrypted to ensure it can only be processed by the aggregation service. During processing, this service adds noise and imposes limits on how many queries can be performed.

Note also the earlier request for a position on Shared Storage and request for a position on Protected Audience (then FLEDGE), with a negative position taken on Shared Storage.

martinthomson commented 5 months ago

Just to clear aside the questions about negative positions on related work, the harms we perceive in those APIs do not necessary translate to this work. As you might be aware, we are generally supportive of ways to give sites tools to learn about what people do in the aggregate, provided that they include privacy protections that meet our standards[^1]. This might not meet those standards, but if it is a standalone feature (see below), we should assess it on its own merits.

[^1]: I'll be the first to concede that these standards are not very well documented. But we're still learning ourselves. For instance, the amount of differential privacy noise (or $\epsilon$) that provides effective privacy without destroying the utility of an API remains an open question.

This is much-belated, but I think we probably need to start with a question or two.

For me, the big question is whether you see this API as providing anything significant relative to the attribution work that is ongoing. Obviously, the intent of this API is to enable reporting information from contexts where reporting would otherwise be proscribed (like Protected Audience), but there are some key differences between this and an API that is designed to only serve an attribution use case. To that end, it would be very helpful if the explainer did a compare-and-contrast. From my review of the work, it seems like one of the bigger differences is the lack of directionality: that is, in attribution, each impression points at a trigger site (or a small set of trigger sites).

I also have questions around the privacy design. It seems like you are using differential privacy, but there isn't a lot of definition around budget management. I mean, at all. Though we might be in the state where $\epsilon$ values are chosen by each implementation, we likely still need an understanding of how budgets are managed overall so that it doesn't become a non-standard free-for-all where sites have no idea what they are getting when they invoke APIs.

Similarly, having reports sent on a delay is a necessary consequence of the use of this within the isolated processing that Protected Audience does. However, this creates unavoidable and problematic information leaks. Did you consider alternative designs where every auction produced an encrypted report from the auction? This necessarily requires some means of coordination between auction participants, but I don't see that being an impossible challenge here, even with component auctions.

That above points to another big question that I have, which is whether this is truly a general purpose mechanism for cross-site aggregation. If you look at IPA or ARA summary reports, those are not inherently limited to use in attribution. But the delays on this API seem entirely motivated by its use in Protected Audience and unnecessary in other contexts (where an aggregatable record could be returned immediately, just like in IPA).

alexmturner commented 3 months ago

While there is certainly some analysis we can do on this API separate from the related proposals, I do think any position on this API will be somewhat inherently linked to the positions on Shared Storage and Protected Audience. That being said, I think there are a range of concerns that could be disentangled, i.e. by evaluating Private Aggregation assuming a “steel man” version of Shared Storage or Protected Audience that addresses some or all of your concerns with those APIs.

Regarding the comparison to attribution reporting, I think the generality of this proposal (in particular when used with Shared Storage), is indeed one of its stronger points and helps it to address a variety of use cases. For example, the API should support reach measurement, which is a problem with a fairly different structure (e.g. no clear directionality) to attribution reporting. Even for attribution reporting, this API could allow for experimentation with attribution models without specific support from the browser, supporting future innovation and reducing ossification risks. I’ll make a note to try and add some additional description to the explainer on this point.

Regarding the privacy design, you’re right that the spec is fairly silent on what the right privacy unit and epsilon should be – although, the explainer does describe our implementation. We originally structured the spec this way given the likely difficulty of aligning on these details across implementations. But I agree that it would probably be better to more precisely define this (e.g. to specify a time-based limit, split by reporting site), while still having some smaller set of parameters that are implementation-defined (e.g. the precise epsilon and perhaps the time window used in the privacy unit). I’ve filed an issue to specify this. Note that the lack of directionality does constrain what privacy units are possible/practical.

Regarding delaying reports, we do have a ‘context ID’ parameter that can currently be set in Shared Storage contexts (and that we plan to allow for Protected Audience sellers). When used, we make the count of reports deterministic (i.e. sending a report even if Private Aggregation is not called in the isolated context) and remove the lengthy delays. Note that there is still a short, fixed delay from the Shared Storage call being invoked to prevent leaks via the Shared Storage running time.

It is more difficult to support this for Protected Audience bidders, given that the set of bidders with interest groups in a particular auction is cross-site information. We did consider an approach where we send exactly one report to each bidder origin listed in the auction config; however, the scale of null/empty reports this would create poses a significant performance challenge. It might be more feasible to always use this “deterministic count” mode (with much lower delays) in the other contexts (Shared Storage and Protected Audience sellers). Still, similar null report scale concerns might apply for some use cases.