private-attribution / ipa

A raw implementation of Interoperable Private Attribution
MIT License
41 stars 23 forks source link

[WIP] Breakdown Reveal Aggregation #1172

Closed cberkhoff closed 1 month ago

cberkhoff commented 3 months ago

This PR is a work in progress to enable Breakdown Reveal Aggregation (aka Improved Aggregation). Aggregation steps happen after attribution. The input to aggregation is a stream of tuples of (attributed breakdown key, attributed trigger value). The output of aggregation is a Histogram. Breakdown Keys and Trigger Values are assigned by the advertiser and sent in the input of IPA. Breakdown Keys values are expected to be dense. How breakdown keys and trigger values are defined is out-of-scope.

High level explanation of the protocol:

  1. Add fake attribution outputs.
  2. Shuffle.
  3. Reveal Breakdown Keys. By having shuffled and adding fake entries we protected the identities of individuals. Trigger values are not revealed.
  4. Aggregation of Trigger Value by Breakdown Key (Think of group by).

DP noise to histogram buckets will be added by the caller of this function. This because adding noise is really unrelated to how things are added up.

Following are the improvements to be done before merging in no particular order:

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 99.56897% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.84%. Comparing base (23202a2) to head (47569e4). Report is 7 commits behind head on main.

Files Patch % Lines
...c/protocol/ipa_prf/aggregation/breakdown_reveal.rs 99.23% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1172 +/- ## ========================================== + Coverage 92.79% 92.84% +0.04% ========================================== Files 197 197 Lines 30523 30735 +212 ========================================== + Hits 28325 28535 +210 - Misses 2198 2200 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cberkhoff commented 3 months ago

This change is still a WIP. Just want to get some baseline numbers from Draft.

cberkhoff commented 3 months ago

I ran a queries and got the following stats:

  Old New  
Runtime [m] 9.583333333 30.0 313.48%
Aggregation / Total Runtime 63.9% 93.59%  
Shuffle   0.41%  
Reveal   4.92%  
Additions   83.05%  

Having this in mind I focus on Additions and then Reveal

cberkhoff commented 3 months ago

With the seq_join in add_tvs_by_bk performance improved by ~6x and now the total runtime is the same between old and new:

  Old New
Runtime [m] 9.583333333 9.5
Aggregation time / Total Runtime 63.9% 79.16%
     
Attribution   23.02%
Shuffle   1.56%
Reveal   18.44%
Additions   56.98%
cberkhoff commented 2 months ago

Using seq_join in reveal_breakdowns made it 14x faster. The protocol as a whole is 18% faster than the old one.

cberkhoff commented 2 months ago

Added a new aggregation logic for Trigger Values into the Histogram. With it, aggregation as a whole is about 3x faster than the current one.

Results running IPA for 1M events      
  Old [duration] New [duration] Improvement
Total Runtime 09:35 03:57 242.62%
Aggregation time 06:08 02:06 292.06%
       
    Duration % of aggregation
Attribution - 01:30 71.43%
Shuffle - 00:07 5.56%
Reveal - 00:05 3.97%
Additions - 00:23 18.25%
cberkhoff commented 2 months ago

Digging trough the logs I think vectorizing can make aggregation take 20seconds off aggregation, which is taking around 126 seconds. There might be more gains from more efficient code.

Spreadsheet