Open preethiraghavan1 opened 1 month ago
Thanks for the very interesting proposal.
I was wondering whether this proposal could be useful in the filtering_id case without requerying.
We want to use Shared Storage with Private Aggregation for two very different use cases, one with low report count (say maximum 1 million per day) with a daily aggregation, and the other one with a high number of reports (say 200 million reports per hour) with a hourly aggregation, both using a different filtering_id to partition the use cases. Could we leverage this proposal so that the hourly job preaggregates the result for the daily one?
Thanks @alois-bissuel for your question.
Suppose you have 2 filtering ids, 1 (= hourly), 2(= daily). With Shared Storage, IIUC, you receive some (200 + 1/24) Million reports hourly. The hourly job (filtering Ids = 1) would process this ~200M reports but the daily job (filtering ids = 2) processes (200 * 24 + 1) = 4801M reports, for aggregating just 1M reports' contributions while filtering the rest out.
You definitely could use Encrypted Intermediates for the daily jobs by generating hourly intermediates.
You can adjust the frequency of intermediate queries (e.g., every 2 hours) or implement incremental EI queries for continuous aggregation. Example with Incremental EI Queries:
EI_query(filtering_ids=2, 1st hour reports)
-> EI_1_REPORTSEI_query(filtering_ids=2, [2nd hour, EI_1_REPORTS])
-> EI_2_REPORTS (incrementally aggregates with the 1st-hour results)Note: Without re-querying, the computation is not reduced, but it is distributed. The benefit would be the low latency of the last query where you will get quicker summary reports.
To make sure we understand how it all fits together, could you provide more details on your use case? What metrics are you trying to compute for the daily vs. the hourly batches? Is it for lower latency or for addressing the challenge of processing the 4 Billion reports in TEE for the daily query, you think Encrypted Intermediates might be useful?
Sorry for the late answer.
For the record, we want to compute statistics on all the bids done on Protected Audience (the hourly job with a lot of data) and lower intensity metrics for the daily status of users -- for instance audience size -- for the daily job.
My main concern was processing time for the daily aggregation (for a low number of reports with the corresponding filtering_id). It looks unreasonable to have to process in one job the 4 billions of daily reports to get a result on only a small fraction of them. We do not necessarily care much about the lower latency, though it has an added value in itself.
Note that this whole problem would not exist if batching_ids are implemented.
I was wondering whether the hourly aggregation could be used to produce two encrypted intermediates for both filtering ids. This would mean that reports would need to be processed only once, as I don't know how job runtime (and need for large executors) behave with the number of records to be processed.
Close.preethiraghavan1
@alois-bissuel, I appreciate you sharing your use case. Processing 4B reports, especially when only a small portion is required for aggregation, does seem excessive. It is an interesting idea to generate multiple Encrypted Intermediates from the same job. We will look into it when developing this feature. Overall, we think Encrypted Intermediate can help in this use case and we are working on the feature details, which we will share once available.
Thanks for the feedback!
Current System
Currently, the Aggregation Service aggregates contributions from raw encrypted reports and produces aggregated and noised histograms. In scenarios where an adtech needs to batch and send the same reports repeatedly for aggregation (e.g., when querying over extended time range like daily, weekly, monthly with re-querying), recomputing the same underlying data multiple times can be computationally expensive and lead to slow query performance while incurring additional unnecessary and avoidable costs.
Note that re-querying is in design phase and is not an available feature in Aggregation Service now.
Proposed Enhancement
Introduce the ability to cache the aggregated contributions from decrypted aggregatable reports in un-noised Encrypted Intermediate reports to be used in subsequent multiple aggregation jobs. Encrypted intermediates can be a way to reduce computations, resulting in reduced latency and cloud cost for adtechs. The reduction in computation comes from decrypting and aggregating for each report only once and caching the un-noised aggregation (the Encrypted Intermediates) for future use. Subsequent jobs that include the cached encrypted intermediates for the same data will only need to decrypt and aggregate new and fewer reports, resulting in overall lower latency.
Motivating Use Cases
Filtering Contributions by filtering Ids, for example, processing by campaign ids to do Reach Measurement as proposed here: Each query with filtering IDs processes all the input reports, even when only a fraction of the data is actually relevant. This inherent recomputation is amplified with re-querying, as it necessitates sending all reports for every query involving filtering IDs and the same data, leading to a compounding amount of avoidable reprocessing. Using Encrypted Intermediates for filtering Ids, can help with avoiding the recomputations. This could be an example of a user journey using filtering ids. Consider an adtech having 200M reports segmenting their data using contribution filtering to 5 campaigns using filtering IDs [1-5], and 100M output domains per filtering ID. Below is an example of how they would use encrypted intermediates to generate summary reports for each individual campaign and then across all 5 campaigns:
The graph below illustrates how an adtech could benefit from Encrypted Intermediates in their workflow. Daily intermediate reports, built incrementally, feed into further intermediate reports for various cadences and segments. We expect that summary report generation leveraging these encrypted intermediates will lead to significant speed improvements compared to using raw reports directly.
Note: The durations are just for illustration purposes. The real durations will vary depending on the query sizes. In general, a query running on intermediates should run faster than the whole large raw reports of the same data.
Design Considerations
Cost Considerations
Using Encrypted Intermediates for a use case depends on cost-benefit analysis. Generating Encrypted Intermediates involves processing, encryption, latency, and storage costs. If savings in processing outweigh these costs, using Encrypted Intermediates is recommended. The cost difference varies based on the use case. Some queries may benefit from using encrypted intermediates, while others may not. Guidance will be provided to help adtechs make this decision.
We believe this enhancement can help adtechs avoid repeated, costly computations while providing latency improvements and overall cost reduction for their jobs. We're interested in your feedback on this idea. In particular,