Should race condition be added as a reason for a signature counter not increasing?

zacknewman commented 1 month ago

Currently § 6.1.1. only states the following as reasons for why a signature counter does not increase:

If either is non-zero, and the new signCount value is less than or equal to the stored value, a cloned authenticator may exist, or the authenticator may be malfunctioning.

However it's possible an older response—from the perspective of the authenticator—is processed after a newer one since there is no guarantee that data that is sent from the client before other data sent from the same client will be received let alone processed before the other. This primarily affects passkey flows and not second-factor ones; since for the latter, RPs can either force at most one active ceremony per credential or use the signCount at the time the ceremony began to compare to.

Is this deemed too unlikely to warrant mention?

As an explicit example:

User starts a passkey authentication ceremony and sends the updated signature counter, C1. Same user starts another passkey authentication ceremony and sends a newer signature counter, C2. Before the server receives the response containing C1, it receives and processes the response containing C2. Finally the server receives and processes the response containing C1 which is less than the current counter, C2.

There are many reasons for why such a thing happens: BGP routes messed up, wonky load balancer, black hole causing time dilation[^1]. The point is that there are technically legitimate reasons for a counter not increasing.

[^1]: I'm obviously being facetious about this one

Firehed commented 1 month ago

I see no immediate harm in pointing it out, but I'm not sure how actionable it would be for any of the involved parties. Trying to differentiate it from a cloned or malfunctioning authenticator could well be impossible. It might be doable with associating counter data with challenges (I think this is what you are getting at) but such an implementation may be highly error-prone and someone that has cloned an authenticator may be able to exploit this. And even still, I think the appropriate thing to do is fail the ceremony, as you would have done previously.

In your example, the majority case (outside of application bugs) user experience would likely be "my sign-in request is hanging so I'll try again", which would tend to result in the C1 request/response getting ignored or aborted by the client - though I suppose it could set off inappropriate some alarm bells on the RP side.

This primarily affects passkey flows and not second-factor ones; since for the latter, RPs can either force at most one active ceremony per credential or use the signCount at the time the ceremony began to compare to.

Can you help me understand how this would differ in practice? For conditional flows, the fact that the request could have started minutes or even hours prior to response processing shouldn't have incremented the counter until the user actually approves the request (and if that's not the case, I'd argue the authenticator is malfunctioning). I think this only creates problems if you're doing counter/challenge associations - effectively, trying to allow a counter rollback to go through under certain scenarios makes a common flow more likely to run into this problem in the first place.

So after thinking it through a bit, my feeling is "this is unlikely enough that it's safe to omit", but "call it out but still recommend failing the ceremony" also seems fine to me. It's also completely possible I'm missing something obvious!

zacknewman commented 1 month ago

I see no immediate harm in pointing it out, but I'm not sure how actionable it would be for any of the involved parties. Trying to differentiate it from a cloned or malfunctioning authenticator could well be impossible.

Indeed. I was not trying to imply this was actionable; merely stating non-malicious reasons for this scenario to occur. That same section states:

Detecting a signature counter mismatch does not indicate whether the current operation was performed by a cloned authenticator or the original authenticator. Relying Parties should address this situation appropriately relative to their individual situations, i.e., their risk tolerance.

so an RP may want to account for these legitimate reasons in their risk tolerance based on whatever probabilities they ascribe.

It might be doable with associating counter data with challenges (I think this is what you are getting at)

That is what I am getting at, and why I stated such a thing would only be possible for "second-factor" flows (i.e., more accurately, non-discoverable requests).

but such an implementation may be highly error-prone and someone that has cloned an authenticator may be able to exploit this.

A careful RP could make this relatively error free. Depending on how the RP achieves this, a cloned authenticator could exploit this; however a "short" timeout makes this less of an issue.

And even still, I think the appropriate thing to do is fail the ceremony, as you would have done previously.

Agreed. Again, I was not implying anything with this issue. I was merely pointing out "legitimate" reasons for a signature counter to not increase. As mentioned earlier, this is likely not actionable; therefore I would indeed fail the ceremony. All a user would have to do is re-try.

In your example, the majority case (outside of application bugs) user experience would likely be "my sign-in request is hanging so I'll try again", which would tend to result in the C1 request/response getting ignored or aborted by the client - though I suppose it could set off inappropriate some alarm bells on the RP side.

Yep.

Can you help me understand how this would differ in practice? For conditional flows, the fact that the request could have started minutes or even hours prior to response processing shouldn't have incremented the counter until the user actually approves the request (and if that's not the case, I'd argue the authenticator is malfunctioning). I think this only creates problems if you're doing counter/challenge associations - effectively, trying to allow a counter rollback to go through under certain scenarios makes a common flow more likely to run into this problem in the first place.

I'm guessing I shouldn't have used the adverb "primarily". It indeed may be the case that most RPs that use non-discoverable requests (i.e., relying on a non-empty PublicKeyCredentialRequestOptions.allowCredentials) are equally susceptible to this. What I was trying to say was that it's at least possible for an RP that uses non-discoverable requests to combat this. A couple of ways are the following:

Allow at most one active ceremony per Credential ID. This can be achieved several ways (e.g., a bit flag saved on the database which is using serializable transactions and perhaps additional exclusive locks to ensure a read does not occur while an update does). The RP only populates allowCredentials with the PublicKeyCredentialDescriptors that aren't associated with an active ceremony. This is the most foolproof but comes at the cost of UX since users should be able to start concurrent ceremonies with the same Credential ID.
As stated, the RP could associate the signature counter with the challenge/ceremony. The RP would allow a user to authenticate so long as the counter is larger than the counter at the time the ceremony started. Additionally the RP would not update the counter unless the saved counter is strictly less. Like you said, timeout duration is correlated with cloned authenticator risk.

Firehed commented 1 month ago

Gotcha, thanks for all of the clarification! Under the context of "be aware this is a non-malicious scenario where it can occur, but probably still let it fail" this seems like a fine addition.

I do fear that if an RP attempts to permit such requests to go through anyway, a meddling party (though not necessarily one that could MITM things - once that's in play, basically all bets are off) might be able to create some sort of side-channel attack if the RP tries to detect and allow this. E.g. a bad actor on the same network could cause enough traffic to get request C1 to hang, then attempt some sort of replay attack.

To be clear, this fear is entirely based on a gut reaction, not any sort of actual cryptographic assessment. If challenges have a proper timeout, it seems entirely infeasible that the bad actor could do anything in the necessary time window (without nation-state resources, at least).

zacknewman commented 1 month ago

In your example, the majority case (outside of application bugs) user experience would likely be "my sign-in request is hanging so I'll try again", which would tend to result in the C1 request/response getting ignored or aborted by the client - though I suppose it could set off inappropriate some alarm bells on the RP side.

I think the most likely "real" scenario—don't misconstrue this as me stating this is "likely" in absolute terms—is using a roaming authenticator (e.g., a USB security key). I plug the USB into my mobile device and authenticate. Before waiting for the process to complete, I unplug it and plug it into my laptop where I authenticate. For reasons already mentioned in addition to weaker resources on the phone, congested mobile network, etc., the authentication succeeds on the laptop first. Shortly after, authentication finishes on the mobile device. Most users would probably wait for the process to complete before removing the authenticator mind you, but it's an example. Perhaps I am self-hosting a password manager and my laptop is on the same LAN; however my mobile device is using data slowing the connection especially since it will likely encounter multiple firewalls that my laptop bypasses and is communicating within a VPN tunnel further slowing traffic.

sbweeden commented 1 month ago

Submitted PR as discussed in WG call on 2024-10-02. @zacknewman given you opened this issue, hope it covers what you were thinking.

nsatragno commented 4 weeks ago

This primarily affects passkey flows and not second-factor ones; since for the latter, RPs can either force at most one active ceremony per credential or use the signCount at the time the ceremony began to compare to.

The RP could snapshot all signCount for all authenticators associated to the user at the time the ceremony began to make that fix work for empty allow lists.

zacknewman commented 4 weeks ago

The RP could snapshot all signCount for all authenticators associated to the user at the time the ceremony began to make that fix work for empty allow lists.

That requires the RP server to know the user handle at the beginning of the ceremony which is not always the case. For non-discoverable requests this is always true since the server needs to know the user handle in order to fetch the registered credentials; however for discoverable requests, the server may not know the user handle until the authentication response is sent (i.e., AuthenticatorAssertionResponseJSON.userHandle is received by the server).

w3c / webauthn

Should race condition be added as a reason for a signature counter not increasing? #2172