matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

One time key counts differ between /sync and /keys/upload #10766

Open deepbluev7 opened 3 years ago

deepbluev7 commented 3 years ago

Description

I noticed that after sending an encrypted in a large E2EE room (#e2ee-test:grin.hu) my key count responses got out of sync (might have happened earlier). On sync I got a count of 55, while when doing an empty key upload request returned a count of 33. Does synapse cache the key count for /sync somehow and that got out of sync?

After uploading some keys, the counts synchronized themselves to 50, so I am assuming the ones in /sync were wrong.

Steps to reproduce

Version information

If not matrix.org:

anoadragon453 commented 3 years ago

The count of OTKs retrieved by /sync is in fact cached at the storage layer. However, the same exact cache is also used for the /keys/upload/{deviceId} endpoint.

Are you making use of Synapse worker mode? Perhaps the cache is different across multiple workers.

deepbluev7 commented 3 years ago

I do indeed use synapse in worker mode. And sync is forwarded to a different worker than keys/upload, which is intended as such if I understand the workers documentation correctly.

anoadragon453 commented 3 years ago

Indeed, that is a valid configuration. The storage method I'm talking about in particular is count_e2e_one_time_keys, which does look to be correctly invalidated when keys are added/deleted, and those changes should be streamed out to other workers. So perhaps this is a red herring.

Incremental /sync responses themselves are cached if the same since_token is provided by the client. It sounds somewhat farfetched, but it is possible that your client was sending the same since_token for /sync multiple times? Do you have server and/or client logs of the network activity to check?

deepbluev7 commented 3 years ago

I don't have any logs for that time period anymore, but this happened across multiple restarts in between of which I received messages. /sync returned the same value after calling upload without any actual keys multiple times (to query the key count on startup to workaround older servers not sending a keycount, if it is 0). So keys weren't added between the different sync and upload calls (until I changed something in our code).

One possible cause I could think of, was that I sent a message in a large, encrypted room. I got around 20 key requests at least at the same time and started uploading keys after the first received ones. So there could potentially have been a race there, where the invalidation happened, a new count was calculated, the invalidation was cleared, but in the mean time the count changed again.

anoadragon453 commented 3 years ago

I haven't heard of many races involving cache invalidation before in Synapse, but it's certainly a possibility.

I think the first step towards solving this would be for someone to replicate the behaviour in either a Synapse unit test or via Sytest or Complement test. Given it's potentially a race (and possibly one between workers), Sytest or Complement would probably be best for having the behaviour emerge. Note that worker support in Complement is still in progress.