Write sections on invalid commits and access control

Bren2010 commented 3 months ago

The reason for this PR was discussed on the mailing list: https://mailarchive.ietf.org/arch/msg/mls/iH61yiILUASWyqt-4cbOwZsUpUQ/

In almost every section, this document describes a.) problems that need to be solved to deploy MLS, and b.) the general space of solutions and trade-offs between them. The issue with this section is that is presents problems with no guidance on solutions

ekr commented 3 months ago

I concur with Richard. I don't disagree that this section doesn't provide clear guidance, but that's because the WG did not come down on specific solutions when it designed MLS. This PR does not change that situation, and I don't think we should land it.

Bren2010 commented 3 months ago

Is there different text we could add to this document that addresses my concerns?

bifurcation commented 3 months ago

Assuming by "addressing my concerns" you mean adding detailed guidance for how to handle invalid commits, I think the answer is "no". The details of how you solve this problem are so tied up with application details that I don't really think there's a way to map out the space of solutions in a succinct enough way to fit in this document. This document's job is to describe the overall design and deployment environment for MLS, not to provide solutions for every problem. If folks wanted to start up an "implementation considerations" document where we can go into some of these things in depth, that would be fine; in fact, it has been discussed at some past WG meetings. But we don't need to weigh the architecture document down with this.

Bren2010 commented 3 months ago

Repeating what I said on the list, I don't think it makes any sense at all to publish this document (which, per the intro, "provides guidance on how to deploy MLS along with discussions of privacy/security trade-offs") without discussing this. These problems are core to every MLS deployment, and many of the solutions implementors have chosen have subtle and/or unexpected effects on the security & privacy properties of their deployment

ekr commented 3 months ago

At this point, I think we just disagree, so I think we need the chairs to figure out how to resolve this.

@seanturner @grittygrease

rohanmahy commented 3 months ago

Repeating what I said on the list, I don't think it makes any sense at all to publish this document (which, per the intro, "provides guidance on how to deploy MLS along with discussions of privacy/security trade-offs") without discussing this. These problems are core to every MLS deployment, and many of the solutions implementors have chosen have subtle and/or unexpected effects on the security & privacy properties of their deployment

The architecture document publishes guidance on things we had consensus to provide guidance on. I proposed guidance in a number of areas that also ended up "on the cutting room floor".

Bren2010 commented 3 months ago

I don't know why your past proposals were rejected, but this text addresses substantial issues with the way MLS has been used in practice:

For all of the MLS implementations that chose the "commit-inspecting / external rejoin" approach, I don't think that negating the PCS properties of MLS was an intentional choice, and I don't think that implementors would've still chosen this approach if they'd read the text in this PR.
Similarly for MIMI, I doubt that MIMI would've been architected the way it has been if the text in this PR had existed. Nobody would stand up at an IETF mic and say "yes, between the options listed, I think we should use the approach that the mls-arch doc explains is less secure"

kkohbrok commented 3 months ago

Regarding the concern about PCS-negation: The consequences of "resync" operations to PCS are well known and documented at the end of Section 12.4.3.2 of the specification. If an application wants to mitigate the resulting risk it can follow the recommendations described in that section, or indeed regularly rotate the signature key. It might be worth referencing that text in the architecture document, though.

As far as I'm aware (and as explored in this paper by @cascremers), Applications that use Signal commonly have a similar problem, where they allow the re-establishment of sessions (to recover from loss of state). The result is the same as with a resync in MLS. The advantage of MLS is that it allows the rotation of signature keys to mitigate this problem.

Bren2010 commented 3 months ago

The text that you link to discusses signature key compromise; the "external rejoin" approach makes it impossible to recover from key schedule compromise. No signature key compromise is required

kkohbrok commented 3 months ago

Then I don't understand the compromise scenario you allude to when you say that rejoining negates PCS. I also re-read the text you propose in this PR and it doesn't make it any clearer. Could you elaborate?

Bren2010 commented 3 months ago

Let's say that Alice's view of the key schedule is leaked in epoch i.
Alice sends a PCS-achieving update which is processed by Bob, starting epoch i+1
- At this point, the compromise is healed through PCS
DS sends an unprocessable commit to Bob, triggering him to do an external rejoin
DS provides Bob with the GroupInfo corresponding to epoch i
Bob does the external join, encrypting a new secret to external_pub, encrypting new secrets to the ratchet tree UpdatePath
- The group is compromised again, allowing the DS to eavesdrop. Because Alice's key schedule at epoch i was leaked, the DS knows the private key to external_pub and enough elements of the ratchet tree to decrypt the UpdatePath

Note that while the GroupInfo is signed, it is not necessarily signed by the member that was compromised, so users might not have a way to revoke that key such that the GroupInfo signature would no longer verify.

kkohbrok commented 3 months ago

Why would Bob accept a GroupInfo from epoch i if he's already successfully processed a commit from epoch i+1? Is there a scenario where it would be necessary (from an operational standpoint) for Bob to accept epoch downgrades?

Bren2010 commented 3 months ago

I don't know of a reason that it would be necessary to accept epoch downgrades, but only accepting epoch upgrades doesn't really prevent this issue (and nobody has mentioned that they do this anyway). That's because every client is essentially a signing oracle for a GroupInfo with epoch n+1:

The DS discovers that Alice's key schedule in epoch i is compromised, but Alice already sent a commit which was processed by Bob.
Eve, who's been offline for a while, comes online and starts processing messages. Once she gets to epoch i, the DS provides all the messages except Alice's commit, and waits for Eve to send a commit to this epoch.
- Now the DS has a signed GroupInfo for a compromised version of epoch i+1.
Frank, who's been offline for a while, comes online and starts processing messages. Once he gets to epoch i, the DS provides an unprocessable commit instead of Alice's or Eve's. Frank tries to externally rejoin, gets the compromised GroupInfo for epoch i+1, and produces an external join.
- Now the DS has a signed GrouInfo for a compromised version of epoch i+2.
- The DS sends an unprocessable commit to Eve, triggering her to externally rejoin. The DS provides her with Frank's GroupInfo for epoch i+2.
- Now the DS has a signed GroupInfo for a compromised version of epoch i+3
- ...

rohanmahy commented 3 months ago

I don't know of a reason that it would be necessary to accept epoch downgrades, but only accepting epoch upgrades doesn't really prevent this issue (and nobody has mentioned that they do this anyway).

Uh, if they are accepting epoch downgrades they are not in compliance with RFC 9420, AFAIK. Here is the first bullet item of Section 12.4.2 on Processing a Commit: "A member of the group applies a Commit message by taking the following steps:

Verify that the epoch field of the enclosing FramedContent is equal to the epoch field of the current GroupContext object."

You seem to be changing your story about this attack between yesterday and today. If Alice has "healed" when she sends a Commit with UpdatePath, the external_pub will have changed and all the ratchet_tree secrets above Alice. If Bob processes the "healing commit", the attacks you describe in your example attack yesterday are no longer possible in the new epoch.

Regarding Eve and Frank not getting Alice's commit in today's attack, even if the DS has a GroupInfo and external_pub, the clients Eve and Frank would need to validate the credentials of all the LeafNodes in the GroupInfo (which only the AS could forge, not the DS). There is a separate whole section in the Security Considerations section about the compromise of the AS.

Also, the application still needs a reason to accept the external commit. There may be app-level authorization. In MIMI, for example, Alice would have needed to add the DS to the roster (an active attack of Alice and a malicious DS), otherwise the clients of Eve and Frank would never authorize the presence of the DS in the GroupInfo.

Finally, there is also the epoch_authenticator or gossip if you want to make sure that different nodes have the same view of the group.

raphaelrobert commented 3 months ago

I don't think Brendan was suggesting the DS literally becomes a group member other than by compromising a client, but I agree with the last part in particular.

I wonder if the whole attack scenario can be collapsed to the following:

The attacker has full control over the DS, which includes not delivering messages to specific members
The attacker compromised a client in epoch i

If that's the case, an attack where the DS isolates members and keeps them on a compromised epoch and subsequently a fork is a trivial one. That's not even specific to MLS, but specific to the fact that all communication is relayed over the DS. As Rohan points out, you need an out-of-band verification mechanism in that case.

I'd suggest writing down the exact attack, including all assumptions, in a document so that folks can give better feedback and iterations become possible.

Bren2010 commented 3 months ago

You are conflating two issues, @raphaelrobert. Yes, if an epoch is compromised while a member is offline, then when the member comes online, the DS can withhold PCS-achieving commits that would end the compromise. The only way to detect that is with out-of-band communication.

I only discussed the potential of "group is compromised while member is offline" to address Konrad's question of "why don't externally joining members just check that the epoch field of the GroupInfo is higher than before?" -- Frank and Eve were always going to be vulnerable to a malicious DS. The client that should NOT be vulnerable is Bob, because Bob already saw and processed the PCS-achieving commit.

However, Bob IS vulnerable, and did not actually achieve PCS by processing the commit (described why here). Even if Bob follows Konrad's suggestion of "check that the epoch counter in GroupInfo is higher than before", Bob still does not achieve PCS (described why here). Although, to be clear, a client performing an external join has no obligation under RFC 9420 to do Konrad's suggested check, so the existence of Frank and Eve is unecessary.

kkohbrok commented 3 months ago

Thanks for elaborating @Bren2010, I understand the attack now. Two observations:

This indeed a bit of a weakness in the resync operation. Instead of relying on a strictly higher epoch, Bob would want to check that the transcript of the new group is an extension of the transcript of his last known good state. An honest DS should still deliver all messages to Bob, so Bob should be able to calculate the relevant hashes, at least if the messages are plaintext handshake messages. Alternatively, the DS could provide hashes of the handshake messages. Then if Bob can't make the link between his last known good epoch and the epoch he resyncs into, he should not resync. I haven't thought through the details, but this can probably be done, at least with plaintext handshake messages. If people think this is worth pursuing, I'll try to write this up and maybe we can add it either to the arch doc or as an errata to RFC9420.
In your attack, the DS gets Bob to essentially roll back Alice's healing commit. But why wouldn't the DS just block Alice's commit to begin with as Raphael describes? Your attack seems likely only in scenarios where Alice's compromise isn't too far back, otherwise Eve and Frank would have processed Alice's commit, too. So rolling back to an epoch that is way in the past seems very unlikely.

Bren2010 commented 3 months ago

This indeed a bit of a weakness in the resync operation. Instead of relying on a strictly higher epoch, Bob would want to check that the transcript of the new group is an extension of the transcript of his last known good state.

To circle back to the top though, the question is whether we should include this text in the document. Given that these decisions impair security and that this consequence isn't immediately apparent to people, I think that's a very strong argument for including!

In your attack, the DS gets Bob to essentially roll back Alice's healing commit. But why wouldn't the DS just block Alice's commit to begin with as Raphael describes?

Like I said, in a pure RFC 9420 implementation the epoch checking doesn't happen, so Frank and Eve are unnecessary. The DS can rollback the group as far as it likes (bounded maybe only by credential expirations)

ekr commented 3 months ago

To circle back to the top though, the question is whether we should include this text in the document. Given that these decisions impair security and that this consequence isn't immediately apparent to people, I think that's a very strong argument for including!

It's an argument for including the analysis, but much of the proposed text is actually additional technical guidance. If you'd like to propose some text that only has additional analysis, I'd be glad to take a look

Bren2010 commented 3 months ago

This text has no guidance / recommendations / "should"s. It is only discussing potential solutions and their trade-offs. Not sure which text you're referring to.

ekr commented 3 months ago

Most of the section entitled "Invalid Commits" in which you discuss a bunch of high-level designs.

Bren2010 commented 3 months ago

I'm sorry, I don't understand what you're asking

rohanmahy commented 3 months ago

Hi @Bren2010 , After our phone call yesteday, I tried to summarize the attack (and related attacks), and any mitigations we discussed. Please let me know if I missed anything.

If a member of a group has a key schedule compromise, a malicious DS can prevent healing of the compromise by blocking Commits from the compromised member or preventing them from being fanned out, or by not forwarding a Remove proposal from the compromised member. (The DS maintains the group in an epoch containing a compromised member, possibly isolating the compromised member into a fork of the group).

This risk can be partially mitigated by having a policy requiring each member to periodically update its keying material or be removed, by revoking the compromised client's credentials, and/or by using a mechanism like the epoch_authenticator to make sure all members have the same view of the group. A partially compromised client might choose to reinitialize or create a new group instead of merely updating its key material.

In a variation of this attack, a malicious DS may present a GroupInfo to external joiners from a compromised epoch instead of the correct epoch. Clients are reminded to validate the LeafNodes of all members (including their credentials) of the ratchet tree in an external join. Members who are rejoining members with an external commit can also make sure that the epoch does not decrease (increases?) during a rejoin. (A malicious DS might even send an invalid commit to other members attempting to trigger a rejoin). Small very privacy sensitive groups may choose to to disable external joins.

Bren2010 commented 2 months ago

That's generally right, although the discussion above is about how the "check that the epoch counter increases on rejoin" solution isn't really effective. I'm not aware of any way we could modify the "DS chooses which commit is right" approach to prevent malicious members from bricking the group, while also not breaking MLS's security. We talked about a lot of solutions that would create various wrinkles for an attacker but nothing robust.

kkohbrok commented 2 months ago

I already proposed for Bob to instead check that the new transcript is an extension of his last known good one. @Bren2010 do you think that still doesn't solve the security downgrade caused by your attack?

Then the main question is still what we want in this document and I think here we agree that a description of the problem and an analysis of the risk is appropriate. What we don't seem to have agreement on is the inclusion of concrete, technical solutions. Or at least not which ones.

Bren2010 commented 2 months ago

I don't know if it addresses the security issue, but it definitely doesn't address the functionality issue of preventing a malicious member from bricking the group, which we talked about in mimi.
My core point in opening this PR is that people are choosing solutions without knowing the full upsides/downsides of their choice, and what other options might exist. In terms of which options we discuss, this PR discusses most every proposal I've seen. #251

seanturner commented 2 months ago

Out for a one-week consensus call; see email.

mlswg / mls-architecture

Write sections on invalid commits and access control #261