Labeling of reports for drill-down queries

tgeoghegan commented 1 year ago

Problem statement

Issue #183 discusses the tradeoff between the anti-replay requirements in the collect sub-protocol and flexibility. In particular, ekr articulates a desire to enable cross tabulation and drilling down.

At a high level, the idea is that if you're measuring some metric across a population -- let's say it's the number of milliseconds it takes to perform some operation -- you might also wish to narrow that query to a subset of members based on what region of the world they are in.

The best thing you can do in DAP today is to create one task for each region your users are in. So you'd have tasks for operation_latency_germany, operation_latency_turkey, and so on. This solution isn't a complete dead end, but is unsatisfactory:

It forces you to commit to a strategy for constructing tasks ahead of gathering data.
Changing that strategy is hard (as it involves updating all the clients and then you have to figure out how to reconcile the data you get from clients on the new strategy with the long tail of clients on the old strategy).
Some of the tasks you create may end up with too few reports to be aggregatable. Maybe you can query over bigger and bigger time intervals to satisfy the minimum batch size, but then doing analysis across tasks can be challenging: for highly populated regions, you might have data with hourly granularity, but only monthly in others.

Issue #183 is about that problem statement. In this issue, I want to sketch out one solution to that problem so we can explore its tradeoffs.

Solution outline

DAP task parameters can optionally include a set of labels that can be applied to each report uploaded in the task. Each label is a key-value pair. The keys are integers[^1][^2]. The values can be either strings or integers[^3]. The type of each label's values is also part of the task parameters.

When uploading a report, Clients MAY include one or more of the labels[^4]. We extend struct ReportMetadata:

uint32 LabelKey;

enum {
  string(0),
  integer(1),
  <possibly more types>
  (255)
} LabelType;

struct {
  LabelKey key;
  LabelType label_type;
  select (Label.label_type) {
      string: opaque value<1..2^32-1>;
      integer: int32 value;
  }
} Label;

struct {
  ReportID report_id;
  Time time;
  Label labels<0..2^8-1>;
} ReportMetadata;

Aggregators then store the labels with the report shares. Then, when Collectors make queries, they can provide zero or more labels and values which instructs the Aggregators to only aggregate over the reports whose labels match every label enumerated in the query[^5]. We extend struct Query:

struct {
  QueryType query_type;
  select (Query.query_type) {
    case time_interval: Interval batch_interval;
    case fixed_size: FixedSizeQuery fixed_size_query;
  }
  Label labels<0..2^8-1>;
} Query;

Batch validation

A batch is now defined by both the existing query (i.e., the time interval or the batch ID for fixed case) and the labels specified by the collector. Otherwise, batch validation works as before:

Batches must meet the task's minimum batch size. If there are not enough reports with the label set requested by the collector, the collection job fails.
The task's max query count applies.
Batches may not overlap. If one query over time interval t sets country = Kenya, and the next sets color = blue, then if any reports had color = blue && country = Kenya, the second query fails[^6].

Troubling implications

Label values are fingerprints

The whole point of DAP is to allow aggregation over sensitive values without revealing individual contributions, but the labels applied to reports may themselves be privacy sensitive. In particular, tasks may define so many labels that they allow high precision fingerprinting of clients by Aggregators, though at least Collectors wouldn't see the labels on individual reports.

Labels and their values are opaque to DAP, so we could punt the problem to deployments and write a security consideration that labels should be chosen carefully. But maybe there's a clever cryptographic solution here? Could the label evaluation be done in MPC, or would that add more rounds to the protocol?

Breaks eager aggregation

In VDAFs that don't have aggregation parameters like the Prio3 family, Aggregators can eagerly aggregate reports ahead of any request from the Collector. This is no longer possible in the presence of labels because the Aggregators can't anticipate what labels the Collectors will query on (and if this were possible, then the Collector should have used distinct tasks in the first place and we wouldn't need labels at all).

I suspect we can live with this problem, especially since deployments can opt out of labels and thus re-enable eager aggregation.

Queries in wrong order can lock Collector out of data

We started with the problem statement of enabling drill-down, which is to say, aggregating over a whole population, and then doing another query over a subset of the population. However due to collection anti-replay requirements this isn't possible: the query over the entire task would consume the max_batch_query_count for every report, and the subsequent query for some label value would fail.

Instead, Collectors would have to make queries over the most specific combinations of labels and then combine those query results outside of DAP. For example, take Prio3Count over each of country = [Kenya, Turkey, Germany], and then sum each of those together to get the count over the entire population.

To do better here, we'd have to relax the batch validation requirements. @simon-friedberger suggested that drill-down could be OK so long as the more specific queries yield disjoint sets, each of which is large enough to meet the task's minimum batch size.

For example, suppose we have a task with minimum batch size 100 and a single label that has 10 possible values. Further we have a time interval with 10,000 reports in it, and each value of the label has 1,000 values. Is it acceptable w.r.t. to allow one aggregation over the 10,000 reports, and then 10 more aggregations over each of the 10 label values?

Now let's suppose the task has another label, each of which has 10 possible values, and that there are 100 reports with each combination of label values. Would 100 more aggregations over each of those combinations be acceptable?

[^1]: What integer width should we use for the keys? [^2]: We could also make the keys strings, but nobody except the collector needs to know what the labels semantically actually are. A mapping of label key to some description can be maintained in the collector, and it's more efficient to encode an integer into reports than arbitrary length strings. [^3]: Do we need label value types besides these? Floating point numbers? [^4]: Should all the labels defined in a task be mandatory? [^5]: As written this implies that queries can only test for value equality, but we could design a richer query language that lets you do things like test string prefixes, suffixes, substrings, do integer comparisons, or do negations. I'm wary of taking on the design challenge of designing a sufficiently rich query language, but there is lots of prior art, such as Apple's NSPredicate. [^6]: This makes a strong case for including negations in the query language, so that the Collector's second query could express color = blue && country != Kenya.

simon-friedberger commented 1 year ago

Troubling implications

Label values are fingerprints

The whole point of DAP is to allow aggregation over sensitive values without revealing individual contributions, but the labels applied to reports may themselves be privacy sensitive. In particular, tasks may define so many labels that they allow high precision fingerprinting of clients by Aggregators, though at least Collectors wouldn't see the labels on individual reports.

Encoding different labels into creating many tasks has the same problem, although a reasonable limit on number of tasks does automatically limit the leakage. Still, given that we also don't specify what the minimum batch size would be I think it would be acceptable to let task creators decide here.

Breaks eager aggregation

In VDAFs that don't have aggregation parameters like the Prio3 family, Aggregators can eagerly aggregate reports ahead of any request from the Collector. This is no longer possible in the presence of labels because the Aggregators can't anticipate what labels the Collectors will query on (and if this were possible, then the Collector should have used distinct tasks in the first place and we wouldn't need labels at all).

I suspect we can live with this problem, especially since deployments can opt out of labels and thus re-enable eager aggregation.

Agreed, there is no reason not aggregate ahead of time if it can be determined that there are enough reports. For example because there are few enough labels to check all combinations.

Queries in wrong order can lock Collector out of data

We started with the problem statement of enabling drill-down, which is to say, aggregating over a whole population, and then doing another query over a subset of the population. However due to collection anti-replay requirements this isn't possible: the query over the entire task would consume the max_batch_query_count for every report, and the subsequent query for some label value would fail.

True, although that's why we have a variable for max_batch_query_count instead of fixing it to 1. Some combination of assumptions about the sensitivity of data, the batch size and application of DP noise might simply make it acceptable to increase that limit.

Instead, Collectors would have to make queries over the most specific combinations of labels and then combine those query results outside of DAP. For example, take Prio3Count over each of country = [Kenya, Turkey, Germany], and then sum each of those together to get the count over the entire population.

To do better here, we'd have to relax the batch validation requirements. @simon-friedberger suggested that drill-down could be OK so long as the more specific queries yield disjoint sets, each of which is large enough to meet the task's minimum batch size.

For example, suppose we have a task with minimum batch size 100 and a single label that has 10 possible values. Further we have a time interval with 10,000 reports in it, and each value of the label has 1,000 values. Is it acceptable w.r.t. to allow one aggregation over the 10,000 reports, and then 10 more aggregations over each of the 10 label values?

Now let's suppose the task has another label, each of which has 10 possible values, and that there are 100 reports with each combination of label values. Would 100 more aggregations over each of those combinations be acceptable?

Logically the answer is yes for this example but I agree with you that there is a general problem. The UX would be very confusing. The order of queries determines which can be answered.

max_batch_query_count

I think the goal here was to limit usage of a particular report under the assumption that every time it gets used in an aggregation some information would leak. Is the fact that this is per batch instead of per report the result of an optimization to never treat individual reports in the aggregators or am I missing something?

cjpatton commented 1 year ago

Overall I'm supportive of addressing the problem and I think the proposal is going in the right direction.

One high-level question: How many variants of a label do we expect to see, and how often does the set of labels change? E.g.: for "geo-location" I would expect a a relatively large number of variants (continents? countries? ASes? ...) but that the set would not change frequently over time; on the other hand, for a "user agent" label I would expect a small number of active variants, but that the set would change with high frequency.

Re labels as fingerprints: I think this is a real concern, at least as concerning as the timestamp. I don't think this is necessarily a deal breaker but it does require privacy considerations. To your question about whether the labels can be moved to MPC: I suspect the answer is "yes", but we have some design work to do. (Details below.)

We want a VDAF for the following aggregation function $F$: for all reports with the same label set, compute some function $f$ of their inputs. I.e.,

$F(\ell, S = (m_1, \ell_1), \ldots, (m_N, \ell_N))) = f( m' : (m', \ell') \in S, \ell'= \ell )$

Suppose that $f$ is just the sum and each $m_i$ is equal to $0$ or $1$ (a la Prio3Count). Then this aggregation function can be computed with Poplar1 out-of-the-box. It's also straightforward to extend it to accommodate multiple labels. With a tweaked version of Poplar1, we could also make it so that the same report can be queried multiple times, up to some bounded number.

It should also be possible to extend the same basic approach to more sophisticated functions $f$, e.g., sum or histogram. but there is some design work to do. We're working on an idea for this but it hasn't been fleshed out fully yet and there is no guarantee it'll pan out. DM me if interested in the details. (cc/ @jimouris, @hannahdaviscrypto)

simon-friedberger commented 1 year ago

I think we should flesh out this fingerprinting concern.

VDAFs theoretically protect the measurements not the metadata. They require hiding those measurements among other measurements which potentially gets circumvented by fingerprinting.

Assuming we add a lot of metadata to reports, for the sake of argument, we add a "client_id". At this point the leader can restrict aggregation to only values from a single client which is obviously very bad.

The leader already sees client IPs when they submit reports. Those are pretty good for fingerprinting. They also know the submission time, but given IP geofencing that probably doesn't add a lot. They know the time at which the client thinks it generated the report which also leaks.

Even without any fingerprinting, the leader can restrict measurements to a single report plus min_batch_size - 1 reports that it made up.

I agree that there is reason for concern. Tags would have to be applied carefully. This presumably reduces their utility. On the other hand. The leader currently sees client IPs which allow batching reports by region but the collector cannot use this.

cjpatton commented 1 year ago

I plan to pursue the "move tags to MPC" thing, in case it ends up being helpful. But overall I don't see a reason to block this, we just need to be careful (as @simon-friedberger points out). Perhaps it would be fruitful to constrain the cardinality of labels somehow? E.g., instead of an arbitrary string, we might require each lablel to be a 16-bit (or even 8-bit?) integer? This would be less flexible, but perhaps make it easier to reason about fingerprints.

branlwyd commented 1 year ago

I'm not sure restricting the cardinality of labels helps against malicious use, in general: even if the label value is constrained to a single bit, the label could effectively be is_this_the_user_I_am_targeting_for_deanonymization.

simon-friedberger commented 1 year ago

While you're correct the client would have to set that label and if you can get the client to do that you can probably get it to do other things to de-anonymize itself.

branlwyd commented 1 year ago

From the POV of a realistic deployment, many users will be unaware of what labels are set by the client software that they use, and especially how those labels are computed; and it would be easy for a malicious entity providing the client software & operating the collector to decide on a much more subtle labeling system to allow deanonymization.

I think labels inherently require trust in the entity providing the client/operating the collector.

cjpatton commented 11 months ago

Hi @simon-friedberger, please take a look at the Mastic VDAF, which attempts to solve part of this problem: https://datatracker.ietf.org/doc/draft-mouris-cfrg-mastic/

wangshan commented 10 months ago

I think we should avoid adding fingerprinting opportunities to the protocol.

The leader already sees client IPs when they submit reports. Those are pretty good for fingerprinting. They also know the submission time, but given IP geofencing that probably doesn't add a lot. They know the time at which the client thinks it generated the report which also leaks.

These shortcoming can be effectively addressed by other technologies like OHTTP & privacypass, for e.g. the reportauth ext: https://github.com/cpriebe/draft-priebe-ppm-dap-reportauth/blob/main/draft-priebe-ppm-dap-reportauth.md proposes one solution.

If I understand correctly, the proposal is to include multiple labels in the same task so collector can filter by them? Alternatively, can we leave that out of MPC and treat as business logic? In DAP the basic unit is a task, all privacy guarantees will be made on a task's aggregation result (min_batch_size, future DP guarantee, etc.). This requirement can be resolved by task author designing the tasks to include all the labels. This essentially divide user population by cohorts, each designated by a label. This way, labelling is more explicit, especially if we address issue: https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/500. If flexibility is a concern, we can use taskprov to make these labels part of a task config and let client decide whether to opt-in with the labelled task or not.

cjpatton commented 10 months ago

The leader already sees client IPs when they submit reports. Those are pretty good for fingerprinting. They also know the submission time, but given IP geofencing that probably doesn't add a lot. They know the time at which the client thinks it generated the report which also leaks.

These shortcoming can be effectively addressed by other technologies like OHTTP & privacypass, for e.g. the reportauth ext: https://github.com/cpriebe/draft-priebe-ppm-dap-reportauth/blob/main/draft-priebe-ppm-dap-reportauth.md proposes one solution.

Agreed on OHTTP, but how does Privacy Pass help?

If I understand correctly, the proposal is to include multiple labels in the same task so collector can filter by them? Alternatively, can we leave that out of MPC and treat as business logic? In DAP the basic unit is a task, all privacy guarantees will be made on a task's aggregation result (min_batch_size, future DP guarantee, etc.). This requirement can be resolved by task author designing the tasks to include all the labels. This essentially divide user population by cohorts, each designated by a label. This way, labelling is more explicit, especially if we address issue: #500. If flexibility is a concern, we can use taskprov to make these labels part of a task config and let client decide whether to opt-in with the labelled task or not.

What's the difference between this proposal and what @tgeoghegan describes at the top of this issue? Taskprov might make this easier (arguably), but it seems basically the same as "spinning up a new task per-label".

bmcase commented 10 months ago

I would like to +1 the usefulness of this functionality for histogram aggregation. I'm new to PPM but have followed for understanding how we could use these VDAFs for protocols we're working on in the W3C PATCG. In particular, I think these labeled secret shares are a great fit to work in Apple's Private Ad Measurement (PAM) proposal for publisher reports.

A couple thoughts on that are that 1) the concern about labels being a fingerprinting vector doesn't apply in the same way in the PAM setting. There what would happen is an encrypted report would be sent back from an impression to a publisher and it is fine the publisher learns which impression that report corresponds to. The label(s) can be added server side before submitting to the MPC aggregation service.

2) As for a report ending up in multiple queries because of different labels -- this would be something we'd like to intentionally support and have a DP budget layer that manages this. In particular, one way this could work is that a report comes with some amount of DP budget and it can then be submitted (only once with replay protection) to the MPC aggregation to be included in multiple histograms but the budget requested across all those histograms needs to be less than the report has available. If there are enough histograms in the output advanced DP composition becomes useful.

cjpatton commented 10 months ago

Hi @bmcase, great to have you.

2. As for a report ending up in multiple queries because of different labels -- this would be something we'd like to intentionally support and have a DP budget layer that manages this.  In particular, one way this could work is that a report comes with some amount of DP budget and it can then be submitted (only once with replay protection) to the MPC aggregation to be included in multiple histograms but the budget requested across all those histograms needs to be less than the report has available.  If there are enough histograms in the output advanced DP composition becomes useful.

Actually there are two lines of work on DP right now.

In https://github.com/wangshan/draft-wang-ppm-differential-privacy we're working on fleshing out our considerations for DP and specifying some standard (non-MPC) DP mechanisms to use with DAP.

In the Slack, we've also been discussing generating the noise in MPC (each Aggregator computes a secret share of the noise), but there is no draft for this as of now. @martinthomson has lots of ideas for this :)

bmcase commented 10 months ago

Thanks @cjpatton. It's great to see this draft of DP in PPM! I had just found it and starting reviewing; I'll leave some feedback over on that repo. On the MPC in DP I've been following the slack conversation and also discussed some with @martinthomson; it's an interesting problem.

cjpatton commented 10 months ago

At IETF 118 we decided to close this issue without changes. Our conclusion was that it should be possible to implement @tgeoghegan's proposal as a report extension.

cjpatton commented 3 months ago

Hey everyone, this thread has been stewing for a while, and given that we're close to finishing DAP, I want to see if there's consensus to close it. A couple of observations:

First, I believe this functionality can be specified as a DAP extension. In particular, the teport extension could be used to assign reports to attribute classes for grouping them during aggregation or collection. We may need an extension point for collection in order to express what attributes the collector wants to collect by. Is there interest in specifying this?

Second, partitioning reports by client attributes in the clear invariably reduces the anonymity set. (In general there may be fewer clients with a particular attribute than the minimum batch size.) A new VDAF called Mastic allows grouping aggregates by client attributes without reducing the anonymity set.

The functionality you get from Mastic is somewhat restricted, in particular it doesn't support all of the "drill down" queries discussed on this thread. It also has higher communication cost than you'd have with Prio3 + plaintext attributes.

An implementation of Mastic and corresponding I-D are in progress:

tgeoghegan commented 3 months ago

The fact that Mastic was able to do something useful with attributes without DAP layer changes proves (or at least suggests) to me that we don't need to do anything here.

I think the remaining use case we're speculating about is Prio3 with labels, but what if instead of making DAP layer changes for that, we imagine a Prio3WithLabels VDAF, in which (shooting from the hip here) the public share was the encoded labels and the aggregation parameter was the labels the collector is interested in? Basically my intuition is that if Mastic made this work, we should be able to imagine a Prio3 variant that can make it work, and one again not need DAP changes.

cjpatton commented 3 months ago

I think the remaining use case we're speculating about is Prio3 with labels, but what if instead of making DAP layer changes for that, we imagine a Prio3WithLabels VDAF, in which (shooting from the hip here) the public share was the encoded labels and the aggregation parameter was the labels the collector is interested in?

Nice observation, I think that would work just fine.

cjpatton commented 3 months ago

I brought this to the list for discussion, and it doesn't seem like there's appetite to modify DAP to support this feature. If you'd like to consider specifying this functionality as a DAP extension and think there needs to be a protocol change, please feel free to re-open.

ietf-wg-ppm / draft-ietf-ppm-dap