Determine aggregation skew recovery strategy

branlwyd commented 1 month ago

Due to transient network issues, inopportunely-timed process shutdown, or implementation bugs, it is possible for the Leader & Helper aggregators to have a "skewed" view of the aggregation process. Specifically, the Helper might consider the aggregation process to one round further than the Leader, if the Leader "loses" the response from the Helper.

This is currently addressed by a skew-recovery scheme where, if the Helper receives a request corresponding to the previous round, it replays its previous response. This allows a Leader who has skewed to catch up to the Helper. This was specified in https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/400, and has been implemented and deployed.

But maybe this isn't the best recovery mechanism. We have also considered:

Restart-from-beginning recovery: in this scheme, recovering from round skew requires the Leader to restart the aggregation process by re-sending the original request. The Helper stores the required information to allow the Leader to restart an aggregation job (i.e. the hash of the initial request). The Leader is expected to store the initial request until aggregation is complete.
Restart-from-arbitrary-step recovery: in this scheme, the Leader is allowed to re-send any of the aggregation requests, and the Helper will replay the aggregation process from that point. The Helper stores the required information to allow the Leader to replay any aggregation step (i.e. a hash of each received request, and the response to that request). The Leader can recover from failure by replaying any request it can reconstruct.

Restart-from-beginning may be better because it is simpler for the Helper to implement, and in practice all of the VDAFs we have considered so far are one-round. Restart-from-arbitrary-step is more flexible than any of the other considered options, but may require storage of too much information.

We should decide the best aggregation skew-recovery scheme, and specify it.

branlwyd commented 1 month ago

During an off-list discussion around https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/pull/616#discussion_r1807087612 today, we discovered that things might be more complicated than previously believed.

Specifically, when performing VDAF preparation, if either aggregator is able to "replay" the preparation interaction for a given report while changing some of the messages it sends to its peer, being able to observe their peer's differing response may let them break the privacy properties of the VDAF. (cc @cjpatton / @divergentdave in case I have misstated things.)

This means that a replay-protection scheme that allows either aggregator to change any of the messages on replay, and observe differing behavior from their peer aggregator, has a problem.

Specifically:

Restart-from-beginning (#616): a malicious Leader could modify any message after the initialization message, and thus >1 round VDAFs may not be safe. A malicious Helper could do the same, if the Leader decided to retry. To make this safe, I think we would need both the Leader and the Helper to store request/response hashes for the entirety of the interaction that they observed.
Restart-from-any-step (#569): a malicious Leader can't modify any re-sent message, as the Helper is specified as checking that they are identical. A malicious Helper could modify messages -- the Leader has no requirement to check that the responses are the same. I think we would need to specify that the Leader store hashes of the Helper's responses for all successfully-processed steps to ensure that there is no problem here.
Restart-from-previous-step (currently specified): I don't believe we need to make changes to this approach. The Leader might re-send its previous request to retry, and the Helper might choose to respond differently than it did the first time; but since the Leader did not act on the original response message, the Helper cannot observe a difference of behavior.

In the specific case of 1-step VDAFs, we do not need to make any changes. All three approaches specify that the Helper must verify that it receives the same first message, so the Leader can't modify the messages it is sending to the Helper; the Helper may be able to modify its response, but since the VDAF is 1-step it will never receive another message from the Leader and thus will not be able to observe a difference in behavior from the Leader. That is, this problem in practice applies only to retry for multi-step (i.e. multi-round) VDAFs.

cjpatton commented 1 month ago

2024/10/24 interim: We're uncomfortable changing the skew recovery, as we may end up introducing rewind-then-fork attacks on VDAF preparation as described here. We'll give this a little more time to discuss, but if there are no new ideas by 2024/10/31, then we'll close this issue.

ietf-wg-ppm / draft-ietf-ppm-dap

Determine aggregation skew recovery strategy #604