[WIP] Breakdown Reveal Aggregation

cberkhoff commented 3 months ago

This PR is a work in progress to enable Breakdown Reveal Aggregation (aka Improved Aggregation). Aggregation steps happen after attribution. The input to aggregation is a stream of tuples of (attributed breakdown key, attributed trigger value). The output of aggregation is a Histogram. Breakdown Keys and Trigger Values are assigned by the advertiser and sent in the input of IPA. Breakdown Keys values are expected to be dense. How breakdown keys and trigger values are defined is out-of-scope.

High level explanation of the protocol:

Add fake attribution outputs.
Shuffle.
Reveal Breakdown Keys. By having shuffled and adding fake entries we protected the identities of individuals. Trigger values are not revealed.
Aggregation of Trigger Value by Breakdown Key (Think of group by).

DP noise to histogram buckets will be added by the caller of this function. This because adding noise is really unrelated to how things are added up.

Following are the improvements to be done before merging in no particular order:

TODO: Use sharded shuffle.
TODO: For aggregation, we pad TV with zeroes until size of HV and add them together. Expanding bits only as needed.
TODO: seq_join (see quicksort). First reveal, create futures for the remainder.
TODO: Add vectorization + chunks. Change TotalRecords::Indeterminate. "Merge-sort".
TODO: Add sharding.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 99.56897% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.84%. Comparing base (23202a2) to head (47569e4). Report is 7 commits behind head on main.

Files	Patch %	Lines
...c/protocol/ipa_prf/aggregation/breakdown_reveal.rs	99.23%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1172 +/- ## ========================================== + Coverage 92.79% 92.84% +0.04% ========================================== Files 197 197 Lines 30523 30735 +212 ========================================== + Hits 28325 28535 +210 - Misses 2198 2200 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cberkhoff commented 3 months ago

This change is still a WIP. Just want to get some baseline numbers from Draft.

cberkhoff commented 3 months ago

I ran a queries and got the following stats:

	Old	New
Runtime [m]	9.583333333	30.0	313.48%
Aggregation / Total Runtime	63.9%	93.59%
Shuffle		0.41%
Reveal		4.92%
Additions		83.05%

Having this in mind I focus on Additions and then Reveal

cberkhoff commented 3 months ago

With the seq_join in add_tvs_by_bk performance improved by ~6x and now the total runtime is the same between old and new:

	Old	New
Runtime [m]	9.583333333	9.5
Aggregation time / Total Runtime	63.9%	79.16%

Attribution		23.02%
Shuffle		1.56%
Reveal		18.44%
Additions		56.98%

cberkhoff commented 2 months ago

Using seq_join in reveal_breakdowns made it 14x faster. The protocol as a whole is 18% faster than the old one.

cberkhoff commented 2 months ago

Added a new aggregation logic for Trigger Values into the Histogram. With it, aggregation as a whole is about 3x faster than the current one.

Results running IPA for 1M events
	Old [duration]	New [duration]	Improvement
Total Runtime	09:35	03:57	242.62%
Aggregation time	06:08	02:06	292.06%

		Duration	% of aggregation
Attribution	-	01:30	71.43%
Shuffle	-	00:07	5.56%
Reveal	-	00:05	3.97%
Additions	-	00:23	18.25%

cberkhoff commented 2 months ago

Digging trough the logs I think vectorizing can make aggregation take 20seconds off aggregation, which is taking around 126 seconds. There might be more gains from more efficient code.

Spreadsheet

private-attribution / ipa

[WIP] Breakdown Reveal Aggregation #1172

Codecov Report