ietf-wg-ppm / draft-ietf-ppm-dap

This document describes the Distributed Aggregation Protocol (DAP) being developed by the PPM working group at IETF.
Other
46 stars 22 forks source link

Determine aggregation skew recovery strategy #604

Closed branlwyd closed 1 month ago

branlwyd commented 1 month ago

Due to transient network issues, inopportunely-timed process shutdown, or implementation bugs, it is possible for the Leader & Helper aggregators to have a "skewed" view of the aggregation process. Specifically, the Helper might consider the aggregation process to one round further than the Leader, if the Leader "loses" the response from the Helper.

This is currently addressed by a skew-recovery scheme where, if the Helper receives a request corresponding to the previous round, it replays its previous response. This allows a Leader who has skewed to catch up to the Helper. This was specified in https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/400, and has been implemented and deployed.

But maybe this isn't the best recovery mechanism. We have also considered:

Restart-from-beginning may be better because it is simpler for the Helper to implement, and in practice all of the VDAFs we have considered so far are one-round. Restart-from-arbitrary-step is more flexible than any of the other considered options, but may require storage of too much information.

We should decide the best aggregation skew-recovery scheme, and specify it.

branlwyd commented 1 month ago

During an off-list discussion around https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/616#discussion_r1807087612 today, we discovered that things might be more complicated than previously believed.

Specifically, when performing VDAF preparation, if either aggregator is able to "replay" the preparation interaction for a given report while changing some of the messages it sends to its peer, being able to observe their peer's differing response may let them break the privacy properties of the VDAF. (cc @cjpatton / @divergentdave in case I have misstated things.)

This means that a replay-protection scheme that allows either aggregator to change any of the messages on replay, and observe differing behavior from their peer aggregator, has a problem.

Specifically:

In the specific case of 1-step VDAFs, we do not need to make any changes. All three approaches specify that the Helper must verify that it receives the same first message, so the Leader can't modify the messages it is sending to the Helper; the Helper may be able to modify its response, but since the VDAF is 1-step it will never receive another message from the Leader and thus will not be able to observe a difference in behavior from the Leader. That is, this problem in practice applies only to retry for multi-step (i.e. multi-round) VDAFs.

cjpatton commented 1 month ago

2024/10/24 interim: We're uncomfortable changing the skew recovery, as we may end up introducing rewind-then-fork attacks on VDAF preparation as described here. We'll give this a little more time to discuss, but if there are no new ideas by 2024/10/31, then we'll close this issue.