Consider adding a custom 'label' to allow more flexible batching

alexmturner commented 1 year ago

Currently, the aggregation service only allows each 'shared ID' to be present in one query. A set of reports with the same shared ID cannot be split for separate queries, even if the resulting batches are disjoint.

One option to add more flexibility is to support an optional, custom field (a ‘label’) that is factored into the shared ID generation. We could consider a few different options:

Putting the field in the shared_info: The reporting origin would be able to easily split reports into separate batches based on the label. However, this approach would require the label to be set outside the isolated (Shared Storage or Protected Audience) context. It also would require the report to be deterministic similar to the context ID, i.e. sending a null report if no contributions are made. This approach is therefore unlikely to work for Protected Audience bidders (see related discussion) and could increase the number of reports sent.
Putting the field in the payload: This avoids the deterministic report requirement and would allow the label to be based on cross-site data, i.e. set from inside the isolated contexts. But, this also prevents the reporting origin from directly determining the label embedded in the report. The reporting origin may therefore have to send a larger number of reports to the aggregation service and ask it to filter based on a given set of labels. For certain use cases, the reporting origin may be able to maintain a context ID to label mapping that would avoid this increased scale, albeit less ergonomically than above.
Allowing bucket range filtering: Instead of using an explicit label, we could allow filtering based on a range of buckets, with budget only used for that range. This could be more flexible but also increases the complexity of the Aggregation Service’s privacy budgeting implementation.
A combination of the above: We could implement multiple of the above options and allow them to be used together or in different situations.

For all of the above approaches, we’ll also need a mechanism to limit the scale impact on the Privacy Budget Service. For example, we want to prevent developers from specifying a unique ‘label’ per report. There are a few options we could consider, including:

The Aggregation Service could limit the number of labels/bucket ranges or shared IDs per query
We could limit the space of allowed labels/bucket ranges directly, e.g. only allowing integer labels up to a maximum value.

This functionality would also be useful for the Attribution Reporting API, so we may want to align on an approach. (For example, bucket range filtering has been proposed earlier.) Note that Attribution Reporting does not currently support making deterministic reports.

csharrison commented 1 year ago

Thanks Alex, I want to note that the context ID / deterministic reports approach is compatible with this related proposal https://github.com/WICG/attribution-reporting-api/issues/974, although it isn't clear all deployments could use that option.

michal-kalisz commented 1 year ago

Thank you for proposing this solution. It seems to be very interesting.

I'm wondering how exactly assigning a label to PAA data would look like. Would it be possible to assign a label for each key, value pair separately, or only once per entire auction?

We have several use cases in which we would like to use PAA: machine learning, monitoring, and reporting. For example, we would like to report:

privateAggregation.contributeToHistogram({bucket: key1, value: val1, label: "ml"})
privateAggregation.contributeToHistogram({bucket: key2, value: val2, label: "ml"})
privateAggregation.contributeToHistogram({bucket: key3, value: val3, label: "monitoring"})
privateAggregation.contributeToHistogram({bucket: key4, value: val4, label: "monitoring"})
privateAggregation.contributeToHistogram({bucket: key5, value: val5, label: "reporting"})

This is related to the fact that each of these cases has different requirements:

ML expects a large amount of data with low noise - we would like to wait a few hours for this data and query the Aggregation Service for aggregated results.
Monitoring expects data as quickly as possible to diagnose problems rapidly.
Reporting is in between - it expects data broken down by hours but can wait for them a bit longer.

It seems that this can also be achieved using proposal 3 - "bucket range filtering". However, if a label can be attached per individual histogram, this solution seems more convenient.

kwanmacher commented 1 year ago

This is a very interesting proposal, thank you!

The support that will be most useful to us are very similar to what @michal-kalisz described above, but applies to ARA summary reporting rather than PAA. There are several use cases that we have which have different latency requirements and operate on data aggregates that have very different cardinality for the different aggregation keys. For example, a reporting use case has many different breakdowns and can wait longer, while a real time monitoring use case might have much fewer breakdowns but require data to be batched up with minimal latency.

Considering that these different use cases will have their values set under different aggregation keys ("reporting", "monitoring") and they will collectively share the same total L1 budget for the report, it will be great if we can have the "label" attached to each of the aggregation keys (i.e. option 2 + per key label), and have the ability to include the same aggregatable report in multiple summary reports, as long as each query uses a disjoint set of labels.

A secondary optimization (can be built on top) is to go with option 1 and store the set of labels in the shared_info to allow for more efficient batching of reports, but this is more of a nice to have.

alexmturner commented 11 months ago

Thanks for all the feedback! We've put up a proposal that we hope satisfies your use cases: https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md.

Note that we've used different terminology to this issue but the proposal aligns with Option 2 (with a possible extension of adding Option 1 later). This proposal allows a separate label for each contribution within a report. And, while the proposal focuses on Private Aggregation, we plan to explore extending it to Attribution Reporting in a separate GitHub issue.

patcg-individual-drafts / private-aggregation-api

Consider adding a custom 'label' to allow more flexible batching #92