Support for Encrypted Intermediates in Aggregation Service: Feedback Requested

preethiraghavan1 commented 1 month ago

Current System

Currently, the Aggregation Service aggregates contributions from raw encrypted reports and produces aggregated and noised histograms. In scenarios where an adtech needs to batch and send the same reports repeatedly for aggregation (e.g., when querying over extended time range like daily, weekly, monthly with re-querying), recomputing the same underlying data multiple times can be computationally expensive and lead to slow query performance while incurring additional unnecessary and avoidable costs.

Note that re-querying is in design phase and is not an available feature in Aggregation Service now.

Proposed Enhancement

Introduce the ability to cache the aggregated contributions from decrypted aggregatable reports in un-noised Encrypted Intermediate reports to be used in subsequent multiple aggregation jobs. Encrypted intermediates can be a way to reduce computations, resulting in reduced latency and cloud cost for adtechs. The reduction in computation comes from decrypting and aggregating for each report only once and caching the un-noised aggregation (the Encrypted Intermediates) for future use. Subsequent jobs that include the cached encrypted intermediates for the same data will only need to decrypt and aggregate new and fewer reports, resulting in overall lower latency.

Motivating Use Cases

Re-querying & Extended Time Range Queries: For frequent reporting or analysis across large timeframes, for example, daily, weekly, monthly queries using the same data, repeatedly processing the same raw data leads to unnecessary computational overhead. Using Encrypted Intermediates can reduce the overhead, ultimately reducing the job latency and cost. (#732)
Filtering Contributions by filtering Ids, for example, processing by campaign ids to do Reach Measurement as proposed here: Each query with filtering IDs processes all the input reports, even when only a fraction of the data is actually relevant. This inherent recomputation is amplified with re-querying, as it necessitates sending all reports for every query involving filtering IDs and the same data, leading to a compounding amount of avoidable reprocessing. Using Encrypted Intermediates for filtering Ids, can help with avoiding the recomputations. This could be an example of a user journey using filtering ids. Consider an adtech having 200M reports segmenting their data using contribution filtering to 5 campaigns using filtering IDs [1-5], and 100M output domains per filtering ID. Below is an example of how they would use encrypted intermediates to generate summary reports for each individual campaign and then across all 5 campaigns:
1. Adtech runs an Encrypted Intermediate job to generate an encrypted intermediate 1 (let’s call this EI-1) for filtering ID 1 (let’s call this fid-1)
2. Adtech runs an aggregation service job for the same fid-1 to generate a summary report
3. Adtech repeats step 1&2 for fid-2 through fid-5
4. To generate a summary report across all 5 campaigns, Adtech would run an aggregation job for all filtering IDs to generate the final summary report that includes all campaigns using EI-1 through EI-5. For this example, using EIs to generate the final summary report will give an 8X improvement in performance/latency compared to requerying without encrypted intermediates.

The graph below illustrates how an adtech could benefit from Encrypted Intermediates in their workflow. Daily intermediate reports, built incrementally, feed into further intermediate reports for various cadences and segments. We expect that summary report generation leveraging these encrypted intermediates will lead to significant speed improvements compared to using raw reports directly.

Encrypted Intermediates timeline Note: The durations are just for illustration purposes. The real durations will vary depending on the query sizes. In general, a query running on intermediates should run faster than the whole large raw reports of the same data.

Design Considerations

Encrypted intermediate reports will also be histograms, but they will be stored un-noised and encrypted.
Encrypted intermediate reports will be padded to a fixed size, similar to raw reports, to prevent revealing the size of the contributions.
They will be written to the cloud location specified in the request, similar to summary reports.
These intermediates can contribute to further queries, both intermediate and final.
Aggregation report accounting will be applied to Encrypted Intermediate reports as they are done for raw reports.

Cost Considerations

Using Encrypted Intermediates for a use case depends on cost-benefit analysis. Generating Encrypted Intermediates involves processing, encryption, latency, and storage costs. If savings in processing outweigh these costs, using Encrypted Intermediates is recommended. The cost difference varies based on the use case. Some queries may benefit from using encrypted intermediates, while others may not. Guidance will be provided to help adtechs make this decision.

We believe this enhancement can help adtechs avoid repeated, costly computations while providing latency improvements and overall cost reduction for their jobs. We're interested in your feedback on this idea. In particular,

What use cases do you think Encrypted Intermediates can be useful for?
What kind of batching and report management assistance would be helpful when using this feature?

alois-bissuel commented 1 month ago

Thanks for the very interesting proposal.

I was wondering whether this proposal could be useful in the filtering_id case without requerying.

We want to use Shared Storage with Private Aggregation for two very different use cases, one with low report count (say maximum 1 million per day) with a daily aggregation, and the other one with a high number of reports (say 200 million reports per hour) with a hourly aggregation, both using a different filtering_id to partition the use cases. Could we leverage this proposal so that the hourly job preaggregates the result for the daily one?

preethiraghavan1 commented 1 month ago

Thanks @alois-bissuel for your question.

Suppose you have 2 filtering ids, 1 (= hourly), 2(= daily). With Shared Storage, IIUC, you receive some (200 + 1/24) Million reports hourly. The hourly job (filtering Ids = 1) would process this ~200M reports but the daily job (filtering ids = 2) processes (200 * 24 + 1) = 4801M reports, for aggregating just 1M reports' contributions while filtering the rest out.

You definitely could use Encrypted Intermediates for the daily jobs by generating hourly intermediates.

Intermediate Queries: For each hour, run an Encrypted Intermediate (EI) query with filtering_id = 2 to aggregate data for that specific hour. Store these encrypted intermediate results.
Final Aggregation: At the end of the 24th hour, combine the 24 intermediate reports using a regular query. This final query will process only 1 million reports' worth of aggregated data plus padding. Note, that the padding in the final query is determined by the output domain used. This final daily query would operate on a significantly smaller dataset, resulting in lower latencies.
Hourly Aggregation: [Existing] Run the regular query for filtering_Ids = 1 for hourly job. These are the jobs which are already being run producing hourly summary reports. This is independent of the intermediate queries in the first step.

You can adjust the frequency of intermediate queries (e.g., every 2 hours) or implement incremental EI queries for continuous aggregation. Example with Incremental EI Queries:

Hour 1: EI_query(filtering_ids=2, 1st hour reports) -> EI_1_REPORTS
Hour 2: EI_query(filtering_ids=2, [2nd hour, EI_1_REPORTS]) -> EI_2_REPORTS (incrementally aggregates with the 1st-hour results)
...
Hour 24: The output of the 24th-hour incremental query represents the aggregated results for the entire day.

Note: Without re-querying, the computation is not reduced, but it is distributed. The benefit would be the low latency of the last query where you will get quicker summary reports.

To make sure we understand how it all fits together, could you provide more details on your use case? What metrics are you trying to compute for the daily vs. the hourly batches? Is it for lower latency or for addressing the challenge of processing the 4 Billion reports in TEE for the daily query, you think Encrypted Intermediates might be useful?

alois-bissuel commented 1 month ago

Sorry for the late answer.

For the record, we want to compute statistics on all the bids done on Protected Audience (the hourly job with a lot of data) and lower intensity metrics for the daily status of users -- for instance audience size -- for the daily job.

My main concern was processing time for the daily aggregation (for a low number of reports with the corresponding filtering_id). It looks unreasonable to have to process in one job the 4 billions of daily reports to get a result on only a small fraction of them. We do not necessarily care much about the lower latency, though it has an added value in itself.

Note that this whole problem would not exist if batching_ids are implemented.

I was wondering whether the hourly aggregation could be used to produce two encrypted intermediates for both filtering ids. This would mean that reports would need to be processed only once, as I don't know how job runtime (and need for large executors) behave with the number of records to be processed.

Jhp1983 commented 3 weeks ago

Close.preethiraghavan1

preethiraghavan1 commented 2 weeks ago

@alois-bissuel, I appreciate you sharing your use case. Processing 4B reports, especially when only a small portion is required for aggregation, does seem excessive. It is an interesting idea to generate multiple Encrypted Intermediates from the same job. We will look into it when developing this feature. Overall, we think Encrypted Intermediate can help in this use case and we are working on the feature details, which we will share once available.

alois-bissuel commented 2 weeks ago

Thanks for the feedback!

privacysandbox / aggregation-service