privacysandbox / aggregation-service

This repository contains instructions and scripts to set up and test the Privacy Sandbox Aggregation Service
Apache License 2.0
63 stars 34 forks source link

Support for Encrypted Intermediates in Aggregation Service: Feedback Requested #77

Open preethiraghavan1 opened 3 weeks ago

preethiraghavan1 commented 3 weeks ago

Current System

Currently, the Aggregation Service aggregates contributions from raw encrypted reports and produces aggregated and noised histograms. In scenarios where an adtech needs to batch and send the same reports repeatedly for aggregation (e.g., when querying over extended time range like daily, weekly, monthly with re-querying), recomputing the same underlying data multiple times can be computationally expensive and lead to slow query performance while incurring additional unnecessary and avoidable costs.

Note that re-querying is in design phase and is not an available feature in Aggregation Service now.

Proposed Enhancement

Introduce the ability to cache the aggregated contributions from decrypted aggregatable reports in un-noised Encrypted Intermediate reports to be used in subsequent multiple aggregation jobs. Encrypted intermediates can be a way to reduce computations, resulting in reduced latency and cloud cost for adtechs. The reduction in computation comes from decrypting and aggregating for each report only once and caching the un-noised aggregation (the Encrypted Intermediates) for future use. Subsequent jobs that include the cached encrypted intermediates for the same data will only need to decrypt and aggregate new and fewer reports, resulting in overall lower latency.

Motivating Use Cases

The graph below illustrates how an adtech could benefit from Encrypted Intermediates in their workflow. Daily intermediate reports, built incrementally, feed into further intermediate reports for various cadences and segments. We expect that summary report generation leveraging these encrypted intermediates will lead to significant speed improvements compared to using raw reports directly.

Encrypted Intermediates timeline Note: The durations are just for illustration purposes. The real durations will vary depending on the query sizes. In general, a query running on intermediates should run faster than the whole large raw reports of the same data.

Design Considerations

Cost Considerations

Using Encrypted Intermediates for a use case depends on cost-benefit analysis. Generating Encrypted Intermediates involves processing, encryption, latency, and storage costs. If savings in processing outweigh these costs, using Encrypted Intermediates is recommended. The cost difference varies based on the use case. Some queries may benefit from using encrypted intermediates, while others may not. Guidance will be provided to help adtechs make this decision.

We believe this enhancement can help adtechs avoid repeated, costly computations while providing latency improvements and overall cost reduction for their jobs. We're interested in your feedback on this idea. In particular,

  1. What use cases do you think Encrypted Intermediates can be useful for?
  2. What kind of batching and report management assistance would be helpful when using this feature?
alois-bissuel commented 3 weeks ago

Thanks for the very interesting proposal.

I was wondering whether this proposal could be useful in the filtering_id case without requerying.

We want to use Shared Storage with Private Aggregation for two very different use cases, one with low report count (say maximum 1 million per day) with a daily aggregation, and the other one with a high number of reports (say 200 million reports per hour) with a hourly aggregation, both using a different filtering_id to partition the use cases. Could we leverage this proposal so that the hourly job preaggregates the result for the daily one?

preethiraghavan1 commented 3 weeks ago

Thanks @alois-bissuel for your question.

Suppose you have 2 filtering ids, 1 (= hourly), 2(= daily). With Shared Storage, IIUC, you receive some (200 + 1/24) Million reports hourly. The hourly job (filtering Ids = 1) would process this ~200M reports but the daily job (filtering ids = 2) processes (200 * 24 + 1) = 4801M reports, for aggregating just 1M reports' contributions while filtering the rest out.

You definitely could use Encrypted Intermediates for the daily jobs by generating hourly intermediates.

  1. Intermediate Queries: For each hour, run an Encrypted Intermediate (EI) query with filtering_id = 2 to aggregate data for that specific hour. Store these encrypted intermediate results.
  2. Final Aggregation: At the end of the 24th hour, combine the 24 intermediate reports using a regular query. This final query will process only 1 million reports' worth of aggregated data plus padding. Note, that the padding in the final query is determined by the output domain used. This final daily query would operate on a significantly smaller dataset, resulting in lower latencies.
  3. Hourly Aggregation: [Existing] Run the regular query for filtering_Ids = 1 for hourly job. These are the jobs which are already being run producing hourly summary reports. This is independent of the intermediate queries in the first step.

You can adjust the frequency of intermediate queries (e.g., every 2 hours) or implement incremental EI queries for continuous aggregation. Example with Incremental EI Queries:

Note: Without re-querying, the computation is not reduced, but it is distributed. The benefit would be the low latency of the last query where you will get quicker summary reports.

To make sure we understand how it all fits together, could you provide more details on your use case? What metrics are you trying to compute for the daily vs. the hourly batches? Is it for lower latency or for addressing the challenge of processing the 4 Billion reports in TEE for the daily query, you think Encrypted Intermediates might be useful?

alois-bissuel commented 1 week ago

Sorry for the late answer.

For the record, we want to compute statistics on all the bids done on Protected Audience (the hourly job with a lot of data) and lower intensity metrics for the daily status of users -- for instance audience size -- for the daily job.

My main concern was processing time for the daily aggregation (for a low number of reports with the corresponding filtering_id). It looks unreasonable to have to process in one job the 4 billions of daily reports to get a result on only a small fraction of them. We do not necessarily care much about the lower latency, though it has an added value in itself.

Note that this whole problem would not exist if batching_ids are implemented.

I was wondering whether the hourly aggregation could be used to produce two encrypted intermediates for both filtering ids. This would mean that reports would need to be processed only once, as I don't know how job runtime (and need for large executors) behave with the number of records to be processed.

Jhp1983 commented 5 days ago

Close.preethiraghavan1