Closed cjpatton closed 2 years ago
For my part, I will concede that, if the attacker colludes with the collector, then there is no hope of defeating Sybil attacks without doing more than what we're doing. Further, I tend to agree with @csharrison and @hostirosti's assessment that client authentication is the best mitigation. However, I don't think we want to require client authentication, do we? There are cheaper (if less complete) ways to mitigate these attacks. In any case, they're out-of-scope for the current model. (We might also consider changing the model.)
However, I don't think we want to require client authentication, do we?
Yeah, requiring client authentication seems like a non-starter for a lot of use cases.
I agree that we should not require client authentication because it sets a high bar for what kind of client can participate in a private data aggregation system. Additionally, I think that we would have a hard time usefully specifying how client authentication should work because the means of injecting identities into clients and establishing trust between clients and aggregation servers will vary widely from deployment to deployment.
However, it's not hard to imagine deployments where client auth is viable and valuable so we should make sure that it's possible for deployments to use an authenticating proxy that can defend against Sybil attacks for additional assurances beyond what this protocol offers. The idea is that the real clients (i.e., app installations or individual garage door openers) can authenticate to a batching client server using some bespoke authentication protocol, and then the batching client can relay a batch of reports to a PDA leader. Since we're already talking about allowing batching clients (#64), I think we're on our way.
To enable that, I think we need a way for a deployment of PDA to restrict who can submit reports. In a batching client setup, if an attacker can obtain the PDAParam
s (which we don't consider confidential; some deployments may wish to make them widely available for transparency, and reverse engineers will be able to extract them from clients anyway), then they can submit reports directly to the leader and defeat the batching client's Sybil attack protection.
So maybe I've come back around to convincing myself that we do want client authentication, or at least enough of it to enable deployments to implement batching clients?
I think we should also evaluate what the state of the art in the space of metrics collection is. What do existing, non-MPC telemetry systems do to protect against Sybil attacks?
I think we should also evaluate what the state of the art in the space of metrics collection is. What do existing, non-MPC telemetry systems do to protect against Sybil attacks?
Such defenses aren't necessary, since normally the collector sees all users' data in the clear. They only matter in the context of anonymous communication.
As of the 2021/07/21 design call, I think this is where we are: Sybil attacks are out-of-scope for the core protocol, but we ought to account for deployments that can afford some sort of client attestation mechanism to mitigate them.
This is a follow up to a discussion we had during the 2021/7/14 design call. We were concerned about a handful of attacks that may require some fundamental design changes to address properly. To focus this discussion, it was decided that we would revisit the security model consider how those attacks fit in. I've taken the liberty of recalling the model and filling in some gaps where necessary. Please, feel free to correct anything that is incorrect or unclear --- we can edit the description as we go. The goal is to get a common understanding of what attacks are in scope and which aren't.
cc/ @ekr, @csharrison
Security model (for privacy)
Execution model. Our execution model has two phases.
[UPDATE 2021/7/26] An important difference between our execution model and that of Prio [BC17] and Hits [BBC+21] is that their network is synchronous (the adversary doesn't control transmission of messages from honest parties) and they assume each connection is ideally authenticated. We're assuming neither, which means our attacker is significantly stronger.
Privacy goal. Currently the doc describes the following, informal security goal: As long as one of the aggregators is honest, the adversary learns nothing about the inputs of honest clients except what it can infer from the output of the protocol. This thinks of the attacker as colluding with the collector.
The details vary slightly, but the Prio [BC17] and Hits [BBC+21] papers formalize this security goal in roughly the following way (see [BC17, Appendix A]): The "view" of the adversary is defined to be the set of messages exchanged during the trusted setup and attack phases, as well as any assets belonging to corrupted parties. A PDA protocol is "private" if the view of every reasonably efficient (i.e., PPT) adversary can be efficiently simulated, given the output of the aggregation function computed over the honest inputs. More precisely, for all inputs
x_1, ..., x_N
and every PPT adversary that corrupts all but one aggregator and all butN
clients, there exists a PPT simulator that, on input off(x_1, ..., x_N)
, outputs a string that is computationally indistinguishable from the adversary's view.Current attacks
The attacks we discussed are:
82 A malicious leader can replay a report across two batches (this may be mitigated already).
81 A malicious leader can replay a report within a batch.
20 A network attacker can try a Sybil attack ("stuff" the batch with
n-1
bogus reports).(As a reminder, these kinds of attacks were anticipated in the original Prio paper. I would suggest folks go back and read [BC17, Section 7]. I found it to be a helpful refresher.)
What's notable about the formalism above is that it concedes Sybil attacks: An attacker can learn a client's input in full, but this would not be deemed an attack by the model. (In other words, the expression "except what it can infer from the output of the protocol" is doing a lot of heavy lifting! See [BC17, Section 7].) On the other hand, a malicious aggregator attempting to replay an honestly generated report would be considered an attack in our model.
[UPDATE 2021/7/26] I don't think the replay attacks are captured by existing formal definitions [BC17, BBC+21].
[BC17] Boneh and Corrigan-Gibbs. "Prio: Private, Robust, and Scalable Computation of Aggregate Statistics." https://crypto.stanford.edu/prio/paper.pdf [BBC+21] Boneh et al. "Lightweight Techniques for Private Heavy Hitters." https://eprint.iacr.org/2021/017