matrix-org / matrix-rust-sdk

Matrix Client-Server SDK for Rust
Apache License 2.0
1.26k stars 252 forks source link

Lost OTK, leading to "OneTime key already exists" error and later UTDs #1415

Closed BillCarsonFr closed 2 months ago

BillCarsonFr commented 1 year ago

From time to time we are getting this error on various clients.

Element R iOS shows it as a modal:

image

Most other clients (including Element Web R and Element X) just log the error without showing it to the user.

This happens (among other reasons) when a client creates a one-time keypair, uploads the public part to the server, and then forgets that it has done so (eg due to unsafe concurrent access to the crypto store). The client is then unable to upload further OTKs until the forgotten one is claimed by another device; and once it is claimed by another device, the resultant messages will be undecryptable.

Anderas commented 1 year ago

Root cause on iOS: recieve_sync_changes is not multiprocess safe, and one process may be writing new values of account to the store whilst another retains stale cache that may eventually override the store with outdated values. When this happens, next_key_id for one time keys is dropped to some lower value, and this then triggers duplicate OTK ids.

Steps to reproduce:

  1. create two OlmMachines, foreground and background
  2. simulate few sync loops on foreground by
    • calling receive_sync_changes with key_counts lower than 50
    • mark_request_as_sent for keysUpload setting it back to 50
    • => as a result foreground will keep creating new OTKs and incrementing next_key_id
  3. call single receive_sync_changes on background => result: background has just overriden account in store with next_key_id that is much lower than background
Anderas commented 1 year ago

Some rageshakes

Anderas commented 1 year ago

Lastly we also need a recovery method, given that OTK upload failure will wedge the machine and never mark OTKs as sent. Possible solution

  1. Pretend that the upload succeeded, setting a safe default value for the uploaded one-time key count on the Account
  2. Claim all your keys on the server, you don't know which one-time key has lost his private parts
  3. Re-upload a new fresh set of one-time keys
richvdh commented 1 year ago

@uhoreg wrote

Should we use the public key as the key ID rather than a counter-based key ID? That would ensure that key IDs don't collide.

It would, but it doesn't fix the underlying problem here, which is that a client has uploaded OTKs and forgotten about them, which is pretty much a guaranteed recipe for UISIs.

uhoreg commented 1 year ago

Agreed about needing to fix the root cause of forgetting about keys. But once the root cause is fixed, using the public key as ID would be a way of at least moving forward. I think the other alternatives are to tell the user that the session is FUBAR and they need to re-log-in, or to have the client claim all of its OTKs and then re-upload new ones.

richvdh commented 1 year ago

If the root cause is fixed, then it doesn't matter what we do?

I think the other alternatives are to tell the user that the session is FUBAR and they need to re-log-in,

That would be preferable, IMHO. I'm just concerned that Element Android and Element iOS are unlikely to be the last applications that make the mistake of running two copies of the rust SDK at once - or indeed that a similar bug might arise in any of a million other ways. And when it does, the fact we've implemented a workaround for it is going to make it that much harder to track down.

IMHO we've been too ready to implement workarounds for crypto bugs in the past (in particular: spamming out key requests left right and centre), and doing so has made it harder to actually fix those bugs.

marcusmueller commented 1 year ago

@richvdh following the discussion on https://github.com/vector-im/element-web/issues/26231#issuecomment-1741816275 , hm, this seems to be a hard problem, even completely ignoring the implementation aspect. (Keeping crypto state between a server and a split-brain client consistent – yay.)

Conventional wisdom would be that anything that wants to update the next key id, whichever shape it takes, to read-modify-write, atomically. That seems to be a hard to achieve constraint in separate processes, which, as far as I understand it, element-web considers it to easier to aim for a single crypto worker instead of hoping to make matrix-rust-sdk-crypto MT-safe. That's of course a bit sad for me as user who'd really like to have multiple tabs of Element open.

As pointed out to me in the issue above, synchronization through (browser) local storage is going to be prohibitive, performance-wise. Now I'm left a bit dumbfounded: if we can't have a central lock on creation of OTKs, would the root cause ever be fixable?

richvdh commented 1 year ago

@marcusmueller as others have already mentioned, the long-term solution in element-web is to move crypto operations into a web worker. In general, it is up to the platform code (web, android and iOS) to ensure that only one copy of the rust-sdk is used on a datastore at a time.

I'm not actually sure what work remains to be done here. AIUI the issue has been solved for now in all platforms by ensuring we only have one rust-sdk process active at a time. @BillCarsonFr @uhoreg wdyt?

marcusmueller commented 1 year ago

@richvdh thanks for the reply! I'd personally say "solved" is a bit of a mischaracterization: it's a rust-sdk shortcoming that's been worked around, to the effect that end users experience significantly regressed usability. Anyways, if the sensible way, and one that is actually attainable in a finite window of time, is going for a webworker architecture: agree, then there's nothing to be done here.

richvdh commented 12 months ago

The issue as reported no longer occurs. If anyone feels that there is more work to be done, please open a new issue.

BillCarsonFr commented 12 months ago

https://github.com/matrix-org/element-android-rageshakes/issues/59348

Reported today on android 1.6.8-dev

BillCarsonFr commented 11 months ago

New occurence reported on EIX https://github.com/element-hq/element-x-ios/issues/2238

kegsay commented 10 months ago

I got this today and dug out some more information. This bug is more pernicious that originally thought because it blocks all keys being uploaded rather than just the affected key.

In my case, I saw in EAX logs (v0.4.0):

This happens constantly now, so I cannot upload more OTKs, meaning I am slowly running out of OTKs which will cause UTDs eventually. It also means I can't upload my fallback key.

richvdh commented 10 months ago

I am slowly running out of OTKs which will cause UTDs eventually

Well, eventually someone will claim AAAAAAAAAEQ, which will mean you can start uploading OTKs again. The real problem is that they will claim the AAAAAAAAAEQ that is on the server, rather than the AAAAAAAAAEQ that your client knows about... causing UTDs.

BillCarsonFr commented 10 months ago

There is another rageshake showing this behaviour: https://github.com/element-hq/element-x-android-rageshakes/issues/1134

But in this one we can see that there was two OlmMachine created, with the same deviceId but different keys:

2023-11-30T18:18:11.343073Z DEBUG matrix_sdk_crypto::machine: Created a new Olm account | crates/matrix-sdk-crypto/src/machine.rs:288 | spans: login{method="identifier and password"} > with_store{user_id="@xuanjun:matrix.org" device_id="KSBNNMTVDE" ed25519_key=yzQQLODIhkZ5v9Tku9fcvodHddLF7+os8SkyFDL1E/8 curve25519_key=y1/IrbHdpFDhdcF5fHI0QVUubNmp5EcGU9WMNL4/pAU}
2023-11-30T18:18:15.945739Z DEBUG matrix_sdk_crypto::machine: Created a new Olm account | crates/matrix-sdk-crypto/src/machine.rs:288 | spans: restore_session > restore_session > with_store{user_id="@xuanjun:matrix.org" device_id="KSBNNMTVDE" ed25519_key=VvjbqDo1DCfzvv/H8nHiiJ2P00uH7DscBWEaw8NZA+8 curve25519_key=/ecKBMJ1HSdIXfaTFduL/Y8pnO+oIxRjHMH83QUXlE8}

later on we can see the error:

2023-11-30T18:20:33.322647Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA0 already exists. Old key: {\"key\":\"NomqZsEfsQgkKXRYhTlqJtVXrPrtyUmwyAp1rkNz5j4\",\"signatures\":{\"@xuanjun:matrix.org\":{\"ed25519:KSBNNMTVDE\":\"xU2pC19yLgr3Cd1oz/gbadpqH3LjlqJShBROCZ5y3UFGLNReHbI1pYfTUppmwdmNMyvmkMk5585BnYvPNArODQ\"}}}; new key: {'key': 'M0oBPMrurD54JggEqCDH1uR4vH4UtoEzohnUWkwl5GA', 'signatures': {'@xuanjun:matrix.org': {'ed25519:KSBNNMTVDE': 'nUWT9sKlU8KuwsfoRBjTJhbVqvCMPRFfQrtdDm21eIhwuFnDDT9OlL9AaK6uNzUtRjUmgduddoM3cE+SKzeEDw'}}}" } })))

The second machine probably trying to upload keys with same ids (ids are just simple increments starting 0)


Update 2024/01/10: Now tracked separately at https://github.com/matrix-org/matrix-rust-sdk/issues/2998

kegsay commented 10 months ago

It would be nice if /keys/upload didn't reject all the keys when just one is duplicated fwiw.

richvdh commented 10 months ago

It would be nice if /keys/upload didn't reject all the keys when just one is duplicated fwiw.

Hard disagree, for the reason already explained at https://github.com/matrix-org/matrix-rust-sdk/issues/1415#issuecomment-1688155066. TL;DR: if we're not correctly persisting DB transactions, it's better we know about it than paper over it and wait for some even more subtle error to happen.

richvdh commented 10 months ago

There is another rageshake showing this behavior https://rageshakes.element.io/api/listing/2023-11-30/183919-DZMA5M6A/logs-0001.log.gz

could you link to the issue for this rageshake?

kegsay commented 10 months ago

So AAAAAAAAAEQ equates to key 68 looking at https://github.com/matrix-org/vodozemac/blob/main/src/types/mod.rs#L39

The error message makes it clear that the key ID was reused with a different key:

message: "One time key signed_curve25519:AAAAAAAAAEQ already exists. 
Old key: {
    \"key\":\"UOPTwvyF3/Kcsh3ToGBCCa/PXIO/42U3VN0fvpfceGI\",
    \"signatures\":{
        \"@kegan:matrix.org\":{
            \"ed25519:GJDDXTSNAD\":\"LG0n4WLaNoEI6zX6pm5mV2OzjxFJEaJEBzQxRskvZG9BU7rFs8rShfYEhuEeYEpMtH2QZT/tMKZehrmib3DLBA\"
        }
    }
};
new key: {
    'key': 'OvKerkVcp3+p8qeaxjxM6N1L+ZyeqzvGOWTBXQeGUxs', 
    'signatures': {
        '@kegan:matrix.org': {
            'ed25519:GJDDXTSNAD': 'YHAneQMnV5vA3BZNMnetBGRP8t3wWH8CCKOdTSz8HCqZ1golF/xj4zXSQhEIeYjuu9s9Az8WCuIabD92pxaEAw'
        }
    }
}

Vodozemac seems to store next_key_id u64 along with unpublished_public_keys: BTreeMap<KeyId, Curve25519PublicKey> in-memory. Both fields get pickled into the same struct https://github.com/matrix-org/vodozemac/blob/main/src/olm/account/one_time_keys.rs#L28

In order for a key ID to be reused then afaict the following has to happen:

I would be surprised if this flow can happen, because I'd expect the pickled struct to be persisted prior to the upload request. If so, this couldn't get out of sync because upon reload, the next key ID would be correct along with the unpublished public keys.

One thing that stood out to me looking at the code is https://github.com/matrix-org/vodozemac/blob/main/src/olm/account/one_time_keys.rs#L56C1-L58C6 - mark_as_published does not take a list of key IDs, instead marking everything in unpublished_public_keys as published. Consider:

I don't think that can cause the dupe key ID issues in this bug though.

richvdh commented 10 months ago

I don't think this has been fixed.

BillCarsonFr commented 8 months ago

Need to check recent rageshakes to see how often it is still occuring

BillCarsonFr commented 7 months ago

New RS https://github.com/element-hq/element-x-ios-rageshakes/issues/1656

gnunicorn commented 7 months ago

We are also experiencing this in several reports that decrypting messages fails. We have reports for iOS (multi instance) as well as MacOS (single instance) of this problem: where we see an endless loop of trying to submit the same key that is rejected by the server eventually leading to the user not being able to decrypt messages. I can provide further logs for info upon request (but not in public).

(Slightly modified for privacy) we are seeing this repeatedly (like hundredth of times)

[2024-04-08][10:04:51.289793][matrix_sdk::encryption][WARN] Error when sending out an outgoing E2EE request error=Http(Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA8 already exists. Old key: {\"key\":\"cuglvp20g6Eg1LmVHXJoprTcgfd61onhOzTjPVWCiiQ\",\"signatures\":{\"@XXX\":{\"ed25519:DUPNNDODMN\":\"sM5EK3lkwuyzP/aEZGVich4b1v9LD6VQ41QZuFVBBbgtKTlxFK9b2/Yepu2WQnlxsaUVHaKe0ubQ50irtn3ACg\"}}}; new key: {'key': 'yDoIRQ3PK3VKbysZV14UvcG4TsVxeFxqbwOAy3foeEA', 'signatures': {'@XXX': {'ed25519:DUPNNDODMN': 'nJOR/n9bxG5VLlVKcyEdZ1cGTcDIV1QpjdKYnhmoqlrk0T4rxguIfHbauMbB+uHgltMSFHNMzt4YFAIp0fuaAA'}}}" } }))))

and then it switches to a different key at some (random) point, but still has the same error:


[2024-04-08][10:06:48.933601][matrix_sdk::encryption][WARN] Error when sending out an outgoing E2EE request error=Http(Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA8 already exists. Old key: {\"key\":\"cuglvp20g6Eg1LmVHXJoprTcgfd61onhOzTjPVWCiiQ\",\"signatures\":{\"XXX\":{\"ed25519:DUPNNDODMN\":\"sM5EK3lkwuyzP/aEZGVich4b1v9LD6VQ41QZuFVBBbgtKTlxFK9b2/Yepu2WQnlxsaUVHaKe0ubQ50irtn3ACg\"}}}; new key: {'key': 'yDoIRQ3PK3VKbysZV14UvcG4TsVxeFxqbwOAy3foeEA', 'signatures': {'@XX': {'ed25519:DUPNNDODMN': 'nJOR/n9bxG5VLlVKcyEdZ1cGTcDIV1QpjdKYnhmoqlrk0T4rxguIfHbauMbB+uHgltMSFHNMzt4YFAIp0fuaAA'}}}" } }))))

And sometimes we are seeing (but only once seen so far):

[2024-04-07][12:37:39.935698][matrix_sdk_crypto::session_manager::sessions][WARN] Error creating Olm session user_id="@XXXX" device_id="BAGEPAEHCQ" error=InvalidSignature { one_time_key: SignedKey { key: "curve25519:8ZgL1Y8o2dLQRmMC0EXqOLR+ZUu8MPEzo8CKSDIFYV8", signatures: Signatures({"@XXXXX": {"ed25519:BAGEPAEHCQ": Ok(Ed25519("ed25519:z26+zO9wFUGouKXQ/rI75SzzqvG3dfkcBblypH/kUP7mYYXASr645AG7TgCmo7hJXgv34qpdVJkt0CNGV+yTBQ"))}}), fallback: Some(Some(true)), other: {} }, signing_key: Some("ed25519:E15iY8LBAymJj3O2AUVX/qWf2dBdM8leu39slyH7REA"), error: VerificationError(Signature(signature::Error { source: Some(Verification equation was not satisfied) })) }
[2024-04-07][12:37:39.955065][matrix_sdk_crypto::session_manager::sessions][WARN] Error creating Olm session user_id="@YYYY" device_id="UBAWJIQUWK" error=InvalidSignature { one_time_key: SignedKey { key: "curve25519:C8n9lk96+Yvk6mZoR08yaCp08OHj4NO71aW1qJ7i8gY", signatures: Signatures({"@YYYY": {"ed25519:UBAWJIQUWK": Ok(Ed25519("ed25519:vbcYScbWIL0HARMUZPRA3uUfD37GuPIBiwrhbMKeo8EdZl+2fi+JR8KJKf46JrOUW3Fnwbas5tjWZUlphKoJBw"))}}), fallback: None, other: {} }, signing_key: Some("ed25519:K39DjnYRNtx3AqmJWrQ2PEDpMM+YgNgirz+ESLj7PGU"), error: VerificationError(Signature(signature::Error { source: Some(Verification equation was not satisfied) })) }
gnunicorn commented 7 months ago

As I was asked to provide any information I have about this: while we had seen "unable to decrypt" and also this sending message occasionally before, it seems to have a huge uptick recently with several of our users reporting this issue (I am not sure whether it is an iOS only issue, but it seems to happen there predominantly, so it can be NSE related).

So I looked at the sdk upgrades we did and from the timeline I think this problem was introduced some time since 1d6c01fb342507b917697d02392686c861cbd0a7. On a quick review (especially of the olm machine) of the changes since I couldn't really see anything directly myself, but maybe this is a useful information for someone else with more in-depth knowledge about that piece of code...

kegsay commented 5 months ago

I still see this in recent rageshakes. It's hard to know if this is due to an older version, as every OTK upload request will fail until the duplicate key ID has been /claim'd by someone.

There is definitely a multiprocess bug which can cause this, see this test which can fail sometimes due to insufficient cross process protection when uploading OTKs. It's currently skipped because it doesn't always reproduce the bug, hence it is flakey.

However, I've also seen this on Element X Android, so that cannot be the only scenario where dupe OTKs can be uploaded.

kegsay commented 5 months ago

In EXA, I see this:

2024-05-14T23:32:01.479257Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAT0 already exists. Old key: {\"key\":\"/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024\...

followed hours later by:

2024-05-15T13:00:28.678130Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024"))

confirming that OTK already exists errors are causing failed Olm sessions and hence UTDs. This particular bug report was sent on Element X/0.4.12, 2024051500; Sdk e709cca2f). At the time the keys were uploaded:

2024-05-14T23:32:00.935934Z DEBUG elementx:  Application version: 0.4.12 (40004120) - release / gplay | 
2024-05-14T23:32:00.936164Z DEBUG elementx:  Git SHA: 4f3a66f2 | 
2024-05-14T23:32:00.936370Z DEBUG elementx:  SDK version: SDK VERSION (TODO) | 
2024-05-14T23:32:00.936646Z DEBUG elementx:  Local time: 05-15 01:32:00.928+0200 | 

The OTKs were uploaded as the app started, possibly hinting at the cause?

poljar commented 5 months ago

In EXA, I see this:

2024-05-14T23:32:01.479257Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAT0 already exists. Old key: {\"key\":\"/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024\...

followed hours later by:

2024-05-15T13:00:28.678130Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024"))

confirming that OTK already exists errors are causing failed Olm sessions and hence UTDs. This particular bug report was sent on Element X/0.4.12, 2024051500; Sdk e709cca). At the time the keys were uploaded:

2024-05-14T23:32:00.935934Z DEBUG elementx:  Application version: 0.4.12 (40004120) - release / gplay | 
2024-05-14T23:32:00.936164Z DEBUG elementx:  Git SHA: 4f3a66f2 | 
2024-05-14T23:32:00.936370Z DEBUG elementx:  SDK version: SDK VERSION (TODO) | 
2024-05-14T23:32:00.936646Z DEBUG elementx:  Local time: 05-15 01:32:00.928+0200 | 

The OTKs were uploaded as the app started, possibly hinting at the cause?

One possibly cause might be that EXA was as well creating multiple client instances: https://github.com/element-hq/element-x-android/pull/3050.

poljar commented 4 months ago

And another case where EX creates multiple Client objects in the same process: https://github.com/element-hq/element-x-ios/pull/2944

kegsay commented 4 months ago

This might be why I've struggled to reproduce this in complement-crypto if they are app-level issues..

richvdh commented 4 months ago

Can we close this then, if it's due to applications mis-using the api (now fixed?) rather than a bug in matrix-rust-sdk?

bnjbvr commented 4 months ago

I would say we're not quite sure it is the root cause, and what would increase confidence a lot is resolving this issue. But we can close this one in the meanwhile, and reopen when/if it happens later (with the same doubt that the linked issue could be the panacea fix for it).

kegsay commented 4 months ago

Collating stats on failed /keys/upload:

Meanwhile, for failed decryptions due to Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey:

Failed /keys/upload will result in InboundCreation(MissingOneTimeKey eventually. What's notable is that EW-R does not appears to have problems with its /keys/upload loop, hinting that the problem is something to do with the mobile apps. The main difference I can think of is that the mobile apps will all use the sqlite crypto store, but EW uses the indexeddb store. Could there be transactional failures here which could cause this?

bnjbvr commented 4 months ago

Could be one of many failure modes: the apps until recently would create several Client instances, which is super bad — I'd expect Web would create only a single OlmMachine; EXI is the only one to use the crypto cross-process lock; etc.

poljar commented 2 months ago

Another case of this, at least on Android will be due to the following:

  1. Android does not use the cross-process lock, so the regenerating of a OlmMachine does not happen.
  2. They are using the NotificationClient.

The NotificationClient seems to have a bug where it unconditionally creates a copy of theClient object with a in-memory state and more importantly a completely new OlmMachine, this OlmMachine will over time fork off from the original OlmMachine the parent Client object has.

The constructor for the NotificationClient calls Client::notification_client():

https://github.com/matrix-org/matrix-rust-sdk/blob/5827bb7ab33b275835f16eacb4c0fecfa79f4a2b/crates/matrix-sdk-ui/src/notification_client.rs#L109-L122

Which in turn calls BaseClient::clone_with_in_memory_state_store():

https://github.com/matrix-org/matrix-rust-sdk/blob/5827bb7ab33b275835f16eacb4c0fecfa79f4a2b/crates/matrix-sdk/src/client/mod.rs#L2207

Which creates a new BaseClient with the OlmMachine set to None:

https://github.com/matrix-org/matrix-rust-sdk/blob/5827bb7ab33b275835f16eacb4c0fecfa79f4a2b/crates/matrix-sdk-base/src/client.rs#L169

Later on we set the OlmMachine to Some with the set_session_meta() call:

https://github.com/matrix-org/matrix-rust-sdk/blob/5827bb7ab33b275835f16eacb4c0fecfa79f4a2b/crates/matrix-sdk-base/src/client.rs#L266-L267

So now the two OlmMachine objects exists creating duplicate one-time keys, forgetting or rolling back Session objects.

This is likely the biggest culprit for this issue and once we fix it we might even close it.

kegsay commented 2 months ago

https://github.com/matrix-org/complement-crypto/pull/133 is a failing regression test for the scenario @poljar is describing. This fails with a dupe OTK upload error, as per this issue's title.