ietf-wg-ppm / draft-ietf-ppm-dap

This document describes the Distributed Aggregation Protocol (DAP) being developed by the PPM working group at IETF.
Other
46 stars 22 forks source link

Requirements for Collect (or, Revisiting the Pipeline Model) #51

Closed cjpatton closed 3 years ago

cjpatton commented 3 years ago

At the moment, the spec thinks of a PA task as a one-way data-processing pipeline with three phases: upload, in which clients send inputs to the leader; verify, in which the leader and helper verify each input; and collect, in which the leader and helper aggregate inputs and push the output to the collector. The upload and verify phases are currently specified; our next task is to specify the collect phase.

This pipeline model is a perfectly fine way of thinking about Prio. (Although there are ways in which one might use Prio metrics that don't quite fit into this model; see #27.) However, HH doesn't fit into this model at all (see #44). It's apparently necessary to re-think the communication model for PA protocols. @chris-wood and I were kicking this around and wanted to share some thoughts.

First off, let's try to enumerate the relevant requirements for the PA framework.

  1. The helper is stateless.
  2. Aggregators (i.e., the leader and helper) enforce the batch size (#15): As long as one of the aggregators is honest, the collector should be unable to see outputs aggregated over a small amount of inputs.
  3. Aggregators should never learn more information about outputs than the collector. Moreover, it should be possible to design a PA protocol in which neither the aggregator sees the output in the clear. (This is possible for Prio. In HH, the aggregators inevitably learn the set of candidate prefixes at each round, which leaks some information about the output.)
  4. Implementing HH requires O(n) space for the leader, where n is the number of inputs. However, it should be possible for the leader to implement Prio with O(1) space.
  5. The aggregators verify and aggregate inputs over multiple rounds. In each round, the collector specifies the parameters that will be used. (In HH, the parameters are the set of candidate prefixes.)
  6. [Needs discussion] Not every PA protocol will verify inputs (#45).

We can't satisfy all of these requirements in the upload->verify->collect model. Instead, we might think of a PA task as two concurrent processes running asynchronously:

In this communication model, the collector may attempt to pull the output for a PA task whenever it likes. A pull will only be successful if the aggregators have verified and aggregated enough inputs.

If/when inputs are verified is up to the PA protocol, but must occur before the leader can respond to a collect request. (In Prio, reports can verified as soon as they're uploaded; in HH, verification can only occur once all the reports have been uploaded and the collector has issued a collect request.)

The helper will keep a long-term symmetric key for encrypting the state that's managed by the collector. This ensures privacy while allowing the helper to be stateless.

At the end of aggregation, each of the aggregators holds a share of the output. Before responding to a collect request, the leader requests the helper's output share. To ensure that the leader doesn't see the output in the clear, the helper encrypts its share under the collector's HPKE public key.

Open questions

  1. Should collect requests be idempotent?
  2. A collect request may take a long time. (For HH, the latency may be several minutes!) Is HTTP the right application?
  3. How does the helper learn the collector's HPKE public key? The collector needs to prove ownership of the corresponding secret key in some way.
acmiyaguchi commented 3 years ago

Perhaps a slight adjustment to upload-verify-collect would be to split collect into aggregate and publish/reconstruct like the v1/v2 protocols. It would prevent overloading the term collect e.g.:

Prio: upload -> verify -> aggregate -> publish HH: upload -> (verify -> aggregate -> verify -> ... -> aggregate) -> publish

I've been thinking about the protocol through the lens of map-reduce pipeline with network storage (e.g. Spark and S3), where computation is represented as a DAG. It looks the like HH computation would fit neatly into a batched pipeline, given a fixed number of verify-aggregate rounds.

Should collect requests be idempotent?

What is the benefit of idempotency in this context? This is so the results of earlier rounds can be used as inputs into later rounds to improve performance?

A collect request may take a long time. (For HH, the latency may be several minutes!) Is HTTP the right application?

One way to deal with the long latency time is to use a task queue e.g. rabbitmq to wait for responses. The collector would then poll the task queue for completion before moving onto the next step (pushing the results to the leader for another round?).

How does the helper learn the collector's HPKE public key? The collector needs to prove ownership of the corresponding secret key in some way.

Is this not solved as part of the upload phase? Or does the upload phase only authenticate the leader with the helpers?

cjpatton commented 3 years ago

Perhaps a slight adjustment to upload-verify-collect would be to split collect into aggregate and publish/reconstruct like the v1/v2 protocols. It would prevent overloading the term collect e.g.:

Prio: upload -> verify -> aggregate -> publish HH: upload -> (verify -> aggregate -> verify -> ... -> aggregate) -> publish

I've been thinking about the protocol through the lens of map-reduce pipeline with network storage (e.g. Spark and S3), where computation is represented as a DAG. It looks the like HH computation would fit neatly into a batched pipeline, given a fixed number of verify-aggregate rounds.

This seems reasonable to me. The one downside I see is that the number of verify-aggregate rounds may not be fixed. For example, the collector might want to start with a set of candidate prefixes. By doing so, it can reduce the number of rounds significantly in applications where the search space can be constrained in some way. (I'd love to hear @csharrison's thoughts on this.)

Can you be more specific on how this fits into a map-reduce pipeline? My guess is that you'd have a map-reduce for each round. It seems to me like this is still possible.

Should collect requests be idempotent?

What is the benefit of idempotency in this context? This is so the results of earlier rounds can be used as inputs into later rounds to improve performance?

This question is about usability, not performance. Imagine the collector is some person on their laptop, interacting with the system. Should the protocol ensure that they can re-run a query multiple times and get back the same result? Or is this up to the implementation?

How does the helper learn the collector's HPKE public key? The collector needs to prove ownership of the corresponding secret key in some way.

Is this not solved as part of the upload phase? Or does the upload phase only authenticate the leader with the helpers?

Sorry, this is a bit confusing. The upload process involves the helper's public key (used by the clients); the collector phase involves collector's public key (used by the helper).

acmiyaguchi commented 3 years ago

Can you be more specific on how this fits into a map-reduce pipeline? My guess is that you'd have a map-reduce for each round. It seems to me like this is still possible.

Each verification/aggregation step would just map over the data, joining appropriately with the outputs of the previous step. The reduce step is done at the very end (with the collect phase). Iterative algorithms work fine with map-reduce (e.g. power iteration with thresholds for eigenvector computation). It's trickier when the workflow involves multiple parties because you have to persist intermediate results to a shared location, but it's certainly doable.

Coordinating work is the challenge, a task queue or something similar in the collector/leader would abstract some of those details.

Should the protocol ensure that they can re-run a query multiple times and get back the same result? Or is this up to the implementation?

It would be difficult to guarantee idempotency if differential privacy were applied server-side, something which has been mentioned before.

chris-wood commented 3 years ago

I've been thinking about the protocol through the lens of map-reduce pipeline with network storage (e.g. Spark and S3), where computation is represented as a DAG. It looks the like HH computation would fit neatly into a batched pipeline, given a fixed number of verify-aggregate rounds.

@acmiyaguchi I think the difference being proposed in this issue is, rather than have the servers drive aggregation (by running some map and reduce functions), it has the collector drive aggregation. This seems to have the benefit of turning aggregators into very dumb entities that hold reports and compute functions over them. (One could still build the map-reduce pipeline that sends aggregate data to some sink this way by having the collector drive aggregation and write it to a sink.)

acmiyaguchi commented 3 years ago

At a high level, there's still a comparable amount of coordination and delegation. Scheduling is nontrivial ("A collect request may take a long time") and will probably involve some task management in the collector.

For reference, here's the cluster management model of Spark. I don't think it's too far fetched to imagine the collector as a driver program and stateless cloud functions as executors.

image https://spark.apache.org/docs/latest/cluster-overview.html

This is (mostly?) orthogonal to the data flow e.g. upload -> verify -> aggregate -> publish.

chris-wood commented 3 years ago

At a high level, there's still a comparable amount of coordination and delegation. Scheduling is nontrivial ("A collect request may take a long time") and will probably involve some task management in the collector.

Yep, they seem to match pretty nicely!

This is (mostly?) orthogonal to the data flow e.g. upload -> verify -> aggregate -> publish.

I need to think about this more, but in my mental model, the protocol looks very different if, say, the collector drives things instead of the being a dumb consumer of data output from the aggregators, even though the flow of data is mostly the same.

ekr commented 3 years ago

A few thoughts here.

As a preliminary, I want to talk about the commercial relationships rather than the network relationships. First, let's call the "Customer" the entity which is paying for the service and presumably (1) is deploying the clients and (2) is interested in the data. So, in the case where we are doing Fx telemetry, Mozilla would be the Customer. The Customer contracts with 1 or more Operators who collectively run the service.

In the simplest conceptual topology, there are two Operators, call them A and B, each of whom runs an Aggregator. One could also have a situation in which the Customer contracts with one Operator and then runs the second Aggregator themselves, but that seems like a case that is likely to be less widely used. In the former scenario,

With that as background:

  1. I think this is going to work best if the PA system presents as a single service to the common Customer, at least technically. By this I mean that once it's set up, the Customer/Collecter should only have to talk to one system (presumably operated by the same entity that operates the Leader). The Leader would in turn talk to the other aggregators. This does not preclude concealing the results from the Leader, but merely requires that the aggregated shares be encrypted so that the Leader cannot see them.

  2. We have been a bit vague about what the Leader/Collector interface is, but I think it's most reasonable to think of it as a standard Web service API. This of course means that if you just want a dashboard a la Amplitude, someone will have to implement that, but the most flexible way to do so is as a system that talks to the Leader over that API. It's been noted that there are times when things are long-running, but there are well-established mechanisms for Web services APIs to handle this kind of thing, such as by the Collector having a notification endpoint or polling.

  3. There are multiple kinds of state to be stored. My expectation is that the Leader operates as effectively a query oracle, but that the Collector be responsible for the direction and order of queries.

    • The Leader will be responsible for storing the reports and whatever memoized information is necessary between queries (e.g., suppose for query A you need to verify X, Y, Z and then for B you need to verify X, Y, W, the Leader could memoize X and Y.). Some of this can be soft state and the Leader can recompute as needed.

    • The Collector is responsible for storing its own context and for knowing what's already been asked and what is to be asked next.

    • The Helpers are stateless.

    Note that this doesn't address the question of any limits on what queries can be made (e.g., for DP). Those have to be enforced at the Leader.

csharrison commented 3 years ago

Two quick comments:

This seems reasonable to me. The one downside I see is that the number of verify-aggregate rounds may not be fixed. For example, the collector might want to start with a set of candidate prefixes. By doing so, it can reduce the number of rounds significantly in applications where the search space can be constrained in some way. (I'd love to hear @csharrison's thoughts on this.)

Yes I think we want the computational model to be robust enough to handle dynamic # of rounds, although maybe it is sufficient to just have the "stopping point" of the protocol be dynamic to fail early when there is no point further expanding the domain.

To ekr's point on helpers being stateless and

Note that this doesn't address the question of any limits on what queries can be made (e.g., for DP). Those have to be enforced at the Leader

I wanted to point out that this choice affects the security guarantees of DP since we wouldn't be ensuring DP for repeated queries / data-stuffing in the 2-pc model but by a single actor (the leader). Within the MPC system we'd only be guaranteeing per-query DP.

cc @hostirosti who is starting to look at these problems on our side as well.

cjpatton commented 3 years ago

Uh, not sure how that got closed! Re-opening.

cjpatton commented 3 years ago

Hi folks, PR #59 tries to specify data collection in a way that satisfies the requirements enumerated here. Also, I believe this closes #44.

cjpatton commented 3 years ago

As discussed on the 2021/7/14 call, all that's left for this issue is to enumerate the design constraints discussed here in the spec. (@chris-wood will file a PR.)

cjpatton commented 3 years ago

93 enumerates the design constraints considered here. It's time to close out this sucker.