patcg-individual-drafts / private-aggregation-api

Explainer for proposed web platform API
https://patcg-individual-drafts.github.io/private-aggregation-api/
39 stars 16 forks source link

Questions on the private aggregation API and Aggregation Service scalability #115

Closed nurien2 closed 2 weeks ago

nurien2 commented 5 months ago

Request: We would like to have more information about the scaling capabilities of the Aggregated Service.

Background: The use case that we have in mind is the one from Request for event-level ReportLoss API · Issue #930 · WICG/turtledove, where we would use the Private Aggregation API with the potential trigger described in Request for event-level ReportLoss API · Issue #930 · WICG/turtledove:

Aha! This sounds like a feature we ought to be able to be able to add. Take a look at the section on Triggering reports — if we added a new trigger that was something like reserved.highest-losing-bid, would it address your needs?

Of course we need to figure out the exact semantics. But a solution that would let you get the information you're looking for out of the Private Aggregation API seems like an excellent goal.

Such a trigger would produce an aggregatable report for every component auction a buyer would take part in. This would represent billions of aggregatable reports per hour, which is at least one order of magnitude above what is defined in https://github.com/privacysandbox/aggregation-service/blob/main/docs/sizing-guidance.md. For the use case described in this issue, we’re investigating performing the aggregation daily, so up to 10^14 of reports to be processed, with a set of size up to 10^12 pre-declared bucket keys. To have a sufficiently wide representation of our feature space (hence the high number of pre-declared keys above) and reach an acceptable level of noise for most of buckets, we need to gather a lot of contributions, and ideally avoid applying any sampling strategy.

On a side note, we have several usages in mind regarding the private aggregation API (see Add new reporting signal script-errors · Issue #494 · WICG/turtledove) targeting different aggregation frequency (hourly, daily, …). To be able to properly lever the aggregation service (and satisfy the underlying rules described in https://github.com/WICG/attribution-reporting-api/blob/main/AGGREGATION_SERVICE_TEE.md#privacy-considerations), we would require that a solution such as the one described in https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md is implemented.

Questions:

keke123 commented 4 months ago

Hello @nurien2 and thanks for the feedback!

Aggregation Service itself does not put an upper limit on the number of keys or reports in a batch but a scale of 10^14 reports and 10^12 keys is currently unsupported due to the memory that would be required. Our sizing guidance indicates the ranges we have tested and recommend for optimal performance given expected load and the supported cloud vm instance types.

We are working on phase 1 of the key discovery proposal which will allow adtechs to query the aggregation service without pre-declaring keys. For phase 1, Adtechs will be able to optionally specify keys to guarantee that they are included in the output. Any keys not pre-declared will be thresholded before being included in the output. Note that while this solution helps to mitigate the challenge of pre-declaring a large number of keys, it doesn’t fully address the challenge of supporting a scale of up to 10^12 keys.

We would like to understand the use case a bit more to explore options to address this (e.g. considering batching strategies, flexible filtering and understanding why sampling isn’t an option). We’re happy to discuss this topic in detail on a WICG call or on this thread. Please let us know how you would prefer to proceed.

Thank you.

alexmturner commented 2 weeks ago

Closing this for now, but please feel free to re-open if you have more feedback.