Closed BillCarsonFr closed 2 months ago
Root cause on iOS: recieve_sync_changes
is not multiprocess safe, and one process may be writing new values of account to the store whilst another retains stale cache that may eventually override the store with outdated values. When this happens, next_key_id
for one time keys is dropped to some lower value, and this then triggers duplicate OTK ids.
Steps to reproduce:
OlmMachine
s, foreground
and background
foreground
by
receive_sync_changes
with key_counts
lower than 50mark_request_as_sent
for keysUpload
setting it back to 50foreground
will keep creating new OTKs and incrementing next_key_id
receive_sync_changes
on background
=> result: background has just overriden account in store with next_key_id
that is much lower than backgroundLastly we also need a recovery method, given that OTK upload failure will wedge the machine and never mark OTKs as sent. Possible solution
@uhoreg wrote
Should we use the public key as the key ID rather than a counter-based key ID? That would ensure that key IDs don't collide.
It would, but it doesn't fix the underlying problem here, which is that a client has uploaded OTKs and forgotten about them, which is pretty much a guaranteed recipe for UISIs.
Agreed about needing to fix the root cause of forgetting about keys. But once the root cause is fixed, using the public key as ID would be a way of at least moving forward. I think the other alternatives are to tell the user that the session is FUBAR and they need to re-log-in, or to have the client claim all of its OTKs and then re-upload new ones.
If the root cause is fixed, then it doesn't matter what we do?
I think the other alternatives are to tell the user that the session is FUBAR and they need to re-log-in,
That would be preferable, IMHO. I'm just concerned that Element Android and Element iOS are unlikely to be the last applications that make the mistake of running two copies of the rust SDK at once - or indeed that a similar bug might arise in any of a million other ways. And when it does, the fact we've implemented a workaround for it is going to make it that much harder to track down.
IMHO we've been too ready to implement workarounds for crypto bugs in the past (in particular: spamming out key requests left right and centre), and doing so has made it harder to actually fix those bugs.
@richvdh following the discussion on https://github.com/vector-im/element-web/issues/26231#issuecomment-1741816275 , hm, this seems to be a hard problem, even completely ignoring the implementation aspect. (Keeping crypto state between a server and a split-brain client consistent – yay.)
Conventional wisdom would be that anything that wants to update the next key id, whichever shape it takes, to read-modify-write, atomically. That seems to be a hard to achieve constraint in separate processes, which, as far as I understand it, element-web considers it to easier to aim for a single crypto worker instead of hoping to make matrix-rust-sdk-crypto MT-safe. That's of course a bit sad for me as user who'd really like to have multiple tabs of Element open.
As pointed out to me in the issue above, synchronization through (browser) local storage is going to be prohibitive, performance-wise. Now I'm left a bit dumbfounded: if we can't have a central lock on creation of OTKs, would the root cause ever be fixable?
@marcusmueller as others have already mentioned, the long-term solution in element-web is to move crypto operations into a web worker. In general, it is up to the platform code (web, android and iOS) to ensure that only one copy of the rust-sdk is used on a datastore at a time.
I'm not actually sure what work remains to be done here. AIUI the issue has been solved for now in all platforms by ensuring we only have one rust-sdk process active at a time. @BillCarsonFr @uhoreg wdyt?
@richvdh thanks for the reply! I'd personally say "solved" is a bit of a mischaracterization: it's a rust-sdk shortcoming that's been worked around, to the effect that end users experience significantly regressed usability. Anyways, if the sensible way, and one that is actually attainable in a finite window of time, is going for a webworker architecture: agree, then there's nothing to be done here.
The issue as reported no longer occurs. If anyone feels that there is more work to be done, please open a new issue.
https://github.com/matrix-org/element-android-rageshakes/issues/59348
Reported today on android 1.6.8-dev
New occurence reported on EIX https://github.com/element-hq/element-x-ios/issues/2238
I got this today and dug out some more information. This bug is more pernicious that originally thought because it blocks all keys being uploaded rather than just the affected key.
In my case, I saw in EAX logs (v0.4.0):
No unused fallback keys were found on the server, generated a new fallback key. removed_fallback_key=None
/keys/upload
request being sent out.One time key signed_curve25519:AAAAAAAAAEQ already exists. Old key:...
This happens constantly now, so I cannot upload more OTKs, meaning I am slowly running out of OTKs which will cause UTDs eventually. It also means I can't upload my fallback key.
I am slowly running out of OTKs which will cause UTDs eventually
Well, eventually someone will claim AAAAAAAAAEQ
, which will mean you can start uploading OTKs again. The real problem is that they will claim the AAAAAAAAAEQ
that is on the server, rather than the AAAAAAAAAEQ
that your client knows about... causing UTDs.
There is another rageshake showing this behaviour: https://github.com/element-hq/element-x-android-rageshakes/issues/1134
But in this one we can see that there was two OlmMachine created, with the same deviceId but different keys:
2023-11-30T18:18:11.343073Z DEBUG matrix_sdk_crypto::machine: Created a new Olm account | crates/matrix-sdk-crypto/src/machine.rs:288 | spans: login{method="identifier and password"} > with_store{user_id="@xuanjun:matrix.org" device_id="KSBNNMTVDE" ed25519_key=yzQQLODIhkZ5v9Tku9fcvodHddLF7+os8SkyFDL1E/8 curve25519_key=y1/IrbHdpFDhdcF5fHI0QVUubNmp5EcGU9WMNL4/pAU}
2023-11-30T18:18:15.945739Z DEBUG matrix_sdk_crypto::machine: Created a new Olm account | crates/matrix-sdk-crypto/src/machine.rs:288 | spans: restore_session > restore_session > with_store{user_id="@xuanjun:matrix.org" device_id="KSBNNMTVDE" ed25519_key=VvjbqDo1DCfzvv/H8nHiiJ2P00uH7DscBWEaw8NZA+8 curve25519_key=/ecKBMJ1HSdIXfaTFduL/Y8pnO+oIxRjHMH83QUXlE8}
later on we can see the error:
2023-11-30T18:20:33.322647Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA0 already exists. Old key: {\"key\":\"NomqZsEfsQgkKXRYhTlqJtVXrPrtyUmwyAp1rkNz5j4\",\"signatures\":{\"@xuanjun:matrix.org\":{\"ed25519:KSBNNMTVDE\":\"xU2pC19yLgr3Cd1oz/gbadpqH3LjlqJShBROCZ5y3UFGLNReHbI1pYfTUppmwdmNMyvmkMk5585BnYvPNArODQ\"}}}; new key: {'key': 'M0oBPMrurD54JggEqCDH1uR4vH4UtoEzohnUWkwl5GA', 'signatures': {'@xuanjun:matrix.org': {'ed25519:KSBNNMTVDE': 'nUWT9sKlU8KuwsfoRBjTJhbVqvCMPRFfQrtdDm21eIhwuFnDDT9OlL9AaK6uNzUtRjUmgduddoM3cE+SKzeEDw'}}}" } })))
The second machine probably trying to upload keys with same ids (ids are just simple increments starting 0)
Update 2024/01/10: Now tracked separately at https://github.com/matrix-org/matrix-rust-sdk/issues/2998
It would be nice if /keys/upload
didn't reject all the keys when just one is duplicated fwiw.
It would be nice if
/keys/upload
didn't reject all the keys when just one is duplicated fwiw.
Hard disagree, for the reason already explained at https://github.com/matrix-org/matrix-rust-sdk/issues/1415#issuecomment-1688155066. TL;DR: if we're not correctly persisting DB transactions, it's better we know about it than paper over it and wait for some even more subtle error to happen.
There is another rageshake showing this behavior https://rageshakes.element.io/api/listing/2023-11-30/183919-DZMA5M6A/logs-0001.log.gz
could you link to the issue for this rageshake?
So AAAAAAAAAEQ
equates to key 68 looking at https://github.com/matrix-org/vodozemac/blob/main/src/types/mod.rs#L39
The error message makes it clear that the key ID was reused with a different key:
message: "One time key signed_curve25519:AAAAAAAAAEQ already exists.
Old key: {
\"key\":\"UOPTwvyF3/Kcsh3ToGBCCa/PXIO/42U3VN0fvpfceGI\",
\"signatures\":{
\"@kegan:matrix.org\":{
\"ed25519:GJDDXTSNAD\":\"LG0n4WLaNoEI6zX6pm5mV2OzjxFJEaJEBzQxRskvZG9BU7rFs8rShfYEhuEeYEpMtH2QZT/tMKZehrmib3DLBA\"
}
}
};
new key: {
'key': 'OvKerkVcp3+p8qeaxjxM6N1L+ZyeqzvGOWTBXQeGUxs',
'signatures': {
'@kegan:matrix.org': {
'ed25519:GJDDXTSNAD': 'YHAneQMnV5vA3BZNMnetBGRP8t3wWH8CCKOdTSz8HCqZ1golF/xj4zXSQhEIeYjuu9s9Az8WCuIabD92pxaEAw'
}
}
}
Vodozemac seems to store next_key_id u64
along with unpublished_public_keys: BTreeMap<KeyId, Curve25519PublicKey>
in-memory. Both fields get pickled into the same struct https://github.com/matrix-org/vodozemac/blob/main/src/olm/account/one_time_keys.rs#L28
In order for a key ID to be reused then afaict the following has to happen:
next_key_id = N
and len(unpublished_public_keys) == N
.next_key_id = N
so will generate new keys and cause duplicates.I would be surprised if this flow can happen, because I'd expect the pickled struct to be persisted prior to the upload request. If so, this couldn't get out of sync because upon reload, the next key ID would be correct along with the unpublished public keys.
One thing that stood out to me looking at the code is https://github.com/matrix-org/vodozemac/blob/main/src/olm/account/one_time_keys.rs#L56C1-L58C6 - mark_as_published
does not take a list of key IDs, instead marking everything in unpublished_public_keys
as published. Consider:
mark_as_published
is called, causing keys N+M to be marked as published but only N actually are. M exist on the client but not the server.I don't think that can cause the dupe key ID issues in this bug though.
I don't think this has been fixed.
Need to check recent rageshakes to see how often it is still occuring
We are also experiencing this in several reports that decrypting messages fails. We have reports for iOS (multi instance) as well as MacOS (single instance) of this problem: where we see an endless loop of trying to submit the same key that is rejected by the server eventually leading to the user not being able to decrypt messages. I can provide further logs for info upon request (but not in public).
(Slightly modified for privacy) we are seeing this repeatedly (like hundredth of times)
[2024-04-08][10:04:51.289793][matrix_sdk::encryption][WARN] Error when sending out an outgoing E2EE request error=Http(Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA8 already exists. Old key: {\"key\":\"cuglvp20g6Eg1LmVHXJoprTcgfd61onhOzTjPVWCiiQ\",\"signatures\":{\"@XXX\":{\"ed25519:DUPNNDODMN\":\"sM5EK3lkwuyzP/aEZGVich4b1v9LD6VQ41QZuFVBBbgtKTlxFK9b2/Yepu2WQnlxsaUVHaKe0ubQ50irtn3ACg\"}}}; new key: {'key': 'yDoIRQ3PK3VKbysZV14UvcG4TsVxeFxqbwOAy3foeEA', 'signatures': {'@XXX': {'ed25519:DUPNNDODMN': 'nJOR/n9bxG5VLlVKcyEdZ1cGTcDIV1QpjdKYnhmoqlrk0T4rxguIfHbauMbB+uHgltMSFHNMzt4YFAIp0fuaAA'}}}" } }))))
and then it switches to a different key at some (random) point, but still has the same error:
[2024-04-08][10:06:48.933601][matrix_sdk::encryption][WARN] Error when sending out an outgoing E2EE request error=Http(Api(Server(ClientApi(Error { status_code: 400, authenticate: None, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAA8 already exists. Old key: {\"key\":\"cuglvp20g6Eg1LmVHXJoprTcgfd61onhOzTjPVWCiiQ\",\"signatures\":{\"XXX\":{\"ed25519:DUPNNDODMN\":\"sM5EK3lkwuyzP/aEZGVich4b1v9LD6VQ41QZuFVBBbgtKTlxFK9b2/Yepu2WQnlxsaUVHaKe0ubQ50irtn3ACg\"}}}; new key: {'key': 'yDoIRQ3PK3VKbysZV14UvcG4TsVxeFxqbwOAy3foeEA', 'signatures': {'@XX': {'ed25519:DUPNNDODMN': 'nJOR/n9bxG5VLlVKcyEdZ1cGTcDIV1QpjdKYnhmoqlrk0T4rxguIfHbauMbB+uHgltMSFHNMzt4YFAIp0fuaAA'}}}" } }))))
And sometimes we are seeing (but only once seen so far):
[2024-04-07][12:37:39.935698][matrix_sdk_crypto::session_manager::sessions][WARN] Error creating Olm session user_id="@XXXX" device_id="BAGEPAEHCQ" error=InvalidSignature { one_time_key: SignedKey { key: "curve25519:8ZgL1Y8o2dLQRmMC0EXqOLR+ZUu8MPEzo8CKSDIFYV8", signatures: Signatures({"@XXXXX": {"ed25519:BAGEPAEHCQ": Ok(Ed25519("ed25519:z26+zO9wFUGouKXQ/rI75SzzqvG3dfkcBblypH/kUP7mYYXASr645AG7TgCmo7hJXgv34qpdVJkt0CNGV+yTBQ"))}}), fallback: Some(Some(true)), other: {} }, signing_key: Some("ed25519:E15iY8LBAymJj3O2AUVX/qWf2dBdM8leu39slyH7REA"), error: VerificationError(Signature(signature::Error { source: Some(Verification equation was not satisfied) })) }
[2024-04-07][12:37:39.955065][matrix_sdk_crypto::session_manager::sessions][WARN] Error creating Olm session user_id="@YYYY" device_id="UBAWJIQUWK" error=InvalidSignature { one_time_key: SignedKey { key: "curve25519:C8n9lk96+Yvk6mZoR08yaCp08OHj4NO71aW1qJ7i8gY", signatures: Signatures({"@YYYY": {"ed25519:UBAWJIQUWK": Ok(Ed25519("ed25519:vbcYScbWIL0HARMUZPRA3uUfD37GuPIBiwrhbMKeo8EdZl+2fi+JR8KJKf46JrOUW3Fnwbas5tjWZUlphKoJBw"))}}), fallback: None, other: {} }, signing_key: Some("ed25519:K39DjnYRNtx3AqmJWrQ2PEDpMM+YgNgirz+ESLj7PGU"), error: VerificationError(Signature(signature::Error { source: Some(Verification equation was not satisfied) })) }
As I was asked to provide any information I have about this: while we had seen "unable to decrypt" and also this sending message occasionally before, it seems to have a huge uptick recently with several of our users reporting this issue (I am not sure whether it is an iOS only issue, but it seems to happen there predominantly, so it can be NSE related).
So I looked at the sdk upgrades we did and from the timeline I think this problem was introduced some time since 1d6c01fb342507b917697d02392686c861cbd0a7. On a quick review (especially of the olm machine) of the changes since I couldn't really see anything directly myself, but maybe this is a useful information for someone else with more in-depth knowledge about that piece of code...
I still see this in recent rageshakes. It's hard to know if this is due to an older version, as every OTK upload request will fail until the duplicate key ID has been /claim'd by someone.
There is definitely a multiprocess bug which can cause this, see this test which can fail sometimes due to insufficient cross process protection when uploading OTKs. It's currently skipped because it doesn't always reproduce the bug, hence it is flakey.
However, I've also seen this on Element X Android, so that cannot be the only scenario where dupe OTKs can be uploaded.
In EXA, I see this:
2024-05-14T23:32:01.479257Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAT0 already exists. Old key: {\"key\":\"/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024\...
followed hours later by:
2024-05-15T13:00:28.678130Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024"))
confirming that OTK already exists errors are causing failed Olm sessions and hence UTDs. This particular bug report was sent on Element X/0.4.12, 2024051500; Sdk e709cca2f). At the time the keys were uploaded:
2024-05-14T23:32:00.935934Z DEBUG elementx: Application version: 0.4.12 (40004120) - release / gplay |
2024-05-14T23:32:00.936164Z DEBUG elementx: Git SHA: 4f3a66f2 |
2024-05-14T23:32:00.936370Z DEBUG elementx: SDK version: SDK VERSION (TODO) |
2024-05-14T23:32:00.936646Z DEBUG elementx: Local time: 05-15 01:32:00.928+0200 |
The OTKs were uploaded as the app started, possibly hinting at the cause?
In EXA, I see this:
2024-05-14T23:32:01.479257Z DEBUG matrix_sdk::http_client: Error while sending request: Api(Server(ClientApi(Error { status_code: 400, body: Standard { kind: Unknown, message: "One time key signed_curve25519:AAAAAAAAAT0 already exists. Old key: {\"key\":\"/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024\...
followed hours later by:
2024-05-15T13:00:28.678130Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:/2oabX2cZa7TUjh6ZDsQsOQI3CKW9o0rJ4WJQU0K024"))
confirming that OTK already exists errors are causing failed Olm sessions and hence UTDs. This particular bug report was sent on Element X/0.4.12, 2024051500; Sdk e709cca). At the time the keys were uploaded:
2024-05-14T23:32:00.935934Z DEBUG elementx: Application version: 0.4.12 (40004120) - release / gplay | 2024-05-14T23:32:00.936164Z DEBUG elementx: Git SHA: 4f3a66f2 | 2024-05-14T23:32:00.936370Z DEBUG elementx: SDK version: SDK VERSION (TODO) | 2024-05-14T23:32:00.936646Z DEBUG elementx: Local time: 05-15 01:32:00.928+0200 |
The OTKs were uploaded as the app started, possibly hinting at the cause?
One possibly cause might be that EXA was as well creating multiple client instances: https://github.com/element-hq/element-x-android/pull/3050.
And another case where EX creates multiple Client
objects in the same process: https://github.com/element-hq/element-x-ios/pull/2944
This might be why I've struggled to reproduce this in complement-crypto if they are app-level issues..
Can we close this then, if it's due to applications mis-using the api (now fixed?) rather than a bug in matrix-rust-sdk?
I would say we're not quite sure it is the root cause, and what would increase confidence a lot is resolving this issue. But we can close this one in the meanwhile, and reopen when/if it happens later (with the same doubt that the linked issue could be the panacea fix for it).
Collating stats on failed /keys/upload
:
request_size="11.9 kiB"
- indicating the initial set of 50 OTKs were re-uploaded, dupe client instance?request_size="255 B"
- indicating a re-upload of a consumed key.request_size="11.9 kiB"
- indicating the initial set of 50 OTKs were re-uploaded, dupe client instance?request_size="1.2 kiB"
request_size="254 B"
Meanwhile, for failed decryptions due to Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey
:
Failed /keys/upload
will result in InboundCreation(MissingOneTimeKey
eventually. What's notable is that EW-R does not appears to have problems with its /keys/upload
loop, hinting that the problem is something to do with the mobile apps. The main difference I can think of is that the mobile apps will all use the sqlite crypto store, but EW uses the indexeddb store. Could there be transactional failures here which could cause this?
Could be one of many failure modes: the apps until recently would create several Client
instances, which is super bad — I'd expect Web would create only a single OlmMachine
; EXI is the only one to use the crypto cross-process lock; etc.
Another case of this, at least on Android will be due to the following:
OlmMachine
does not happen.NotificationClient
.The NotificationClient
seems to have a bug where it unconditionally creates a copy of theClient
object with a in-memory state and more importantly a completely new OlmMachine
, this OlmMachine
will over time fork off from the original OlmMachine
the parent Client
object has.
The constructor for the NotificationClient
calls Client::notification_client()
:
Which in turn calls BaseClient::clone_with_in_memory_state_store()
:
Which creates a new BaseClient
with the OlmMachine
set to None
:
Later on we set the OlmMachine
to Some
with the set_session_meta()
call:
So now the two OlmMachine
objects exists creating duplicate one-time keys, forgetting or rolling back Session
objects.
This is likely the biggest culprit for this issue and once we fix it we might even close it.
https://github.com/matrix-org/complement-crypto/pull/133 is a failing regression test for the scenario @poljar is describing. This fails with a dupe OTK upload error, as per this issue's title.
From time to time we are getting this error on various clients.
Element R iOS shows it as a modal:
Most other clients (including Element Web R and Element X) just log the error without showing it to the user.
This happens (among other reasons) when a client creates a one-time keypair, uploads the public part to the server, and then forgets that it has done so (eg due to unsafe concurrent access to the crypto store). The client is then unable to upload further OTKs until the forgotten one is claimed by another device; and once it is claimed by another device, the resultant messages will be undecryptable.