`Device list doesn't change if remote server is down` is flaky

clokep commented 3 years ago

https://github.com/matrix-org/synapse/runs/3570688017

I think this was just re-enabled?

clokep commented 3 years ago

And https://github.com/matrix-org/synapse/runs/3587797065

clokep commented 3 years ago

And https://github.com/matrix-org/synapse/runs/3588399409

DMRobertson commented 3 years ago

Well, I mean to make a PR for this but screwed up my git branch management locally.

Let's see how this goes?

H-Shay commented 3 years ago

Just noting that I had a failure on this test as well today (with several re-runs)

DMRobertson commented 3 years ago

I resorted to sticking debug statements into synapse. My understanding: on the first keys query,

Synapse under test receives keys from mock server in sytest
Synapse ends up calling self.device_handler.device_list_updater.user_device_resync on the master process (it seems device lists are always handled by the master)
This later calls update_remote_device_list_cache
At the end of that function we upsert into device_lists_remote_extremeties
- user_id='@__ANON__-0:localhost:45313' stream_id=3

All normal operation as far as I can see. On the second keys query,

in get_user_devices_from_cache we call get_device_list_last_stream_id_for_remotes which reads from device_lists_remote_extremeties
The response from this leads us to build user_map={'@__ANON__-0:localhost:41307': None}
we act as if we've not stored keys from the previous federation request, so go and fetch them again
this fails the test.

Guessing there's a race between the cache invalidation and the second device keys call?? Commenting out the @cachedList call makes the test pass (with user_map={'@__ANON__-0:localhost:34477': '3'}).

Notes:

get_device_list_last_stream_id_for_remotes function is decorated with @cachedList
it looks like we try to invalidate the cache right before we do the upsert mentioned above
cache done on the main worker, as is the query that builds "user_map"
the cache invalidation will be called "on the main twisted thread after the transaction has finished".

erikjohnston commented 3 years ago

Oh bleurgh, that's annoying.

The other way we deal with this sort of thing is by using retry_until_success { ... } to wrap the test code, which will repeatedly run it until it passes (with some appropriate exponential backoff). The downside with this approach is that failures will result in the test timing out, rather than in a more helpful manner, so it's important to add helpful logging. An example usage:

https://github.com/matrix-org/sytest/blob/66cace301feceeb704c5502aea84776994d19123/tests/10apidoc/33room-members.pl#L107-L120

So I think we can wrap the second call to fetch keys in that and it should work?

DMRobertson commented 3 years ago

Thanks @erikjohnston, that sounds like a good avenue of attack.

For completeness, I wanted to demonstrate the race in the logs. Here's a version where I've made sytest sleep 1; between key requests.

Logs with extra sytest sleep

``` 2021-09-15 18:33:44 ✔ $ rg "/keys|DMR" /tmp/logs/server-* --color=always | sort --key 2,4 -t : -s | less /tmp/logs/server-0/synchrotron.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-100 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/event_persister2.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-100 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/event_persister1.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-132 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/appservice.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-148 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/frontend_proxy.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-164 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/background_worker.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-180 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/federation_sender.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-196 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/pusher.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-212 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/user_dir.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-228 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/media_repository.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-68 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,168 - root - 70 - WARNING - process-replication-data-20 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/federation_reader.log:2021-09-15 17:32:21,170 - root - 70 - WARNING - process-replication-data-36 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/client_reader.log:2021-09-15 17:32:21,170 - root - 70 - WARNING - process-replication-data-52 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/event_creator.log:2021-09-15 17:32:21,171 - root - 70 - WARNING - process-replication-data-132 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840') /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,778 - synapse.handlers.e2e_keys - 117 - WARNING - POST-0 - DMR: query_devices /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,778 - synapse.handlers.e2e_keys - 137 - WARNING - POST-0 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:35035': []} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,782 - synapse.storage.databases.main.devices - 474 - WARNING - POST-0 - DMR: user_map={'@__ANON__-0:localhost:35035': None} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,782 - synapse.storage.databases.main.devices - 481 - WARNING - POST-0 - DMR: users_needing_resync=set() /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,783 - synapse.handlers.e2e_keys - 171 - WARNING - POST-0 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:35035'} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,783 - synapse.handlers.e2e_keys - 227 - WARNING - POST-0 - DMR: fetch user_id='@__ANON__-0:localhost:35035', device_list=[] /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,783 - synapse.handlers.e2e_keys - 231 - WARNING - POST-0 - DMR: AAAAA /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,784 - synapse.handlers.e2e_keys - 236 - WARNING - POST-0 - DMR: BBBBB room_ids=frozenset({'!MYFsKlpvDpfGpmYNQU:localhost:8840'}) /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,784 - synapse.handlers.e2e_keys - 240 - WARNING - POST-0 - DMR: CCCCC /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,784 - synapse.handlers.e2e_keys - 251 - WARNING - POST-0 - DMR: slave /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,791 - synapse.handlers.device - 844 - WARNING - POST-10 - DMR: user_device_resync for user_id='@__ANON__-0:localhost:35035' /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,804 - synapse.handlers.device - 895 - WARNING - POST-10 - DMR: result={'user_id': '@__ANON__-0:localhost:35035', 'devices': [{'device_display_name': 'Curiosity Rover', 'keys': {'signatures': {'@__ANON__-0:localhost:35035': {'ed2 5519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'device_id': 'CURIOSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'algorithms': ['fast', 'and broken'], 'user_id': '@__ANON__-0:localhost:35035'}, 'device_id': 'CURIOSIT Y_ROVER'}], 'stream_id': 3} /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,806 - synapse.handlers.device - 945 - WARNING - POST-10 - DMR: ignore_devices=False /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,806 - synapse.storage.databases.main.devices - 1261 - WARNING - POST-10 - DMR: user_id='@__ANON__-0:localhost:35035' devices=[{'device_display_name': 'Curiosity Rover', 'keys': {'signatures': {'@__ANON__-0:localhost:35035': {'ed 25519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'device_id': 'CURIOSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'algorithms': ['fast', 'and broken'], 'user_id': '@__ANON__-0:localhost:35035'}, 'device_id': 'CURIOSI TY_ROVER'}] stream_id=3 /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,807 - synapse.storage.databases.main.devices - 1306 - WARNING - POST-10 - DMR: extremities user_id='@__ANON__-0:localhost:35035' stream_id=3 /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,809 - synapse.handlers.device - 953 - WARNING - POST-10 - DMR: finished marking cache as valid /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,815 - synapse.access.http.8836 - 389 - INFO - POST-0 - 127.0.0.1 - 8836 - {@anon-20210915_173213-1:localhost:8840} Processed request: 0.039sec/-0.000sec (0.008sec, 0.000sec) (0.001sec/0.003sec/6) 440B 200 "POST /_matrix/clien t/r0/keys/query?access_token= HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts] /tmp/logs/server-0/event_persister1.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-152 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/appservice.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-168 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/frontend_proxy.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-184 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/background_worker.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-200 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/federation_sender.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-216 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/pusher.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-232 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/user_dir.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-248 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/synchrotron.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-120 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/event_persister2.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-120 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/event_creator.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-152 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/client_reader.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-72 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/media_repository.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-88 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,924 - root - 70 - WARNING - process-replication-data-40 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/federation_reader.log:2021-09-15 17:32:21,924 - root - 70 - WARNING - process-replication-data-56 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035') /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,825 - synapse.handlers.e2e_keys - 117 - WARNING - POST-1 - DMR: query_devices /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,826 - synapse.handlers.e2e_keys - 137 - WARNING - POST-1 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:35035': []} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,833 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:35035': '3'} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,836 - synapse.storage.databases.main.devices - 481 - WARNING - POST-1 - DMR: users_needing_resync=set() /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,840 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache=set() /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,842 - synapse.access.http.8836 - 389 - INFO - POST-1 - 127.0.0.1 - 8836 - {@anon-20210915_173213-1:localhost:8840} Processed request: 0.017sec/-0.000sec (0.011sec, 0.000sec) (0.003sec/0.004sec/4) 493B 200 "POST /_matrix/clien t/r0/keys/query?access_token= HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts] ```

In particular, note that the stream_writer is told over replication about the device lists having changed:

  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,924 - root - 70 - WARNING - process-replication-data-40 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')

So we get a correct cache hit:

  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,826 - synapse.handlers.e2e_keys - 137 - WARNING - POST-1 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:35035': []}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,833 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:35035': '3'}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,836 - synapse.storage.databases.main.devices - 481 - WARNING - POST-1 - DMR: users_needing_resync=set()
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,840 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache=set()

Here's the version without the sleep.

Without sleep

``` /tmp/logs/server-0/frontend_proxy.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-161 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/synchrotron.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-177 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/media_repository.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-193 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/event_creator.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-209 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/federation_sender.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-225 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/appservice.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-113 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/event_persister2.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-113 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/background_worker.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-129 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/pusher.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-145 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/event_persister1.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-17 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-65 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/federation_reader.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-81 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/user_dir.log:2021-09-15 17:45:20,282 - root - 70 - WARNING - process-replication-data-33 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/client_reader.log:2021-09-15 17:45:20,283 - root - 70 - WARNING - process-replication-data-49 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840') /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,071 - synapse.handlers.e2e_keys - 117 - WARNING - POST-0 - DMR: query_devices /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,071 - synapse.handlers.e2e_keys - 137 - WARNING - POST-0 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:33603': []} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,074 - synapse.storage.databases.main.devices - 474 - WARNING - POST-0 - DMR: user_map={'@__ANON__-0:localhost:33603': None} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.handlers.e2e_keys - 171 - WARNING - POST-0 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:33603'} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.handlers.e2e_keys - 227 - WARNING - POST-0 - DMR: fetch user_id='@__ANON__-0:localhost:33603', device_list=[] /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.handlers.e2e_keys - 231 - WARNING - POST-0 - DMR: AAAAA /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.storage.databases.main.devices - 481 - WARNING - POST-0 - DMR: users_needing_resync=set() /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,077 - synapse.handlers.e2e_keys - 236 - WARNING - POST-0 - DMR: BBBBB room_ids=frozenset({'!PSQPVgzGgcyCSphXvA:localhost:8840'}) /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,077 - synapse.handlers.e2e_keys - 240 - WARNING - POST-0 - DMR: CCCCC /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,077 - synapse.handlers.e2e_keys - 251 - WARNING - POST-0 - DMR: slave /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,083 - synapse.handlers.device - 844 - WARNING - POST-10 - DMR: user_device_resync for user_id='@__ANON__-0:localhost:33603' /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,096 - synapse.handlers.device - 895 - WARNING - POST-10 - DMR: result={'user_id': '@__ANON__-0:localhost:33603', 'stream_id': 3, 'devices': [{'device_display_name': 'Curiosity Rover', 'device_id': 'CURIOSITY_ROVER', 'keys': {'de vice_id': 'CURIOSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'signatures': {'@__ANON__-0:localhost:33603': {'ed25519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'user_id': '@__ANON__-0:localhost:33603', 'algorithms': ['fast', 'and broken']}}]} /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,098 - synapse.handlers.device - 945 - WARNING - POST-10 - DMR: ignore_devices=False /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,098 - synapse.storage.databases.main.devices - 1261 - WARNING - POST-10 - DMR: user_id='@__ANON__-0:localhost:33603' devices=[{'device_display_name': 'Curiosity Rover', 'device_id': 'CURIOSITY_ROVER', 'keys': {'device_id': 'CURI OSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'signatures': {'@__ANON__-0:localhost:33603': {'ed25519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'user_id': '@__ANON__-0:localhost:33603', 'algorithms': ['fast', 'and broken']}}] stream_id=3 /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,100 - synapse.storage.databases.main.devices - 1306 - WARNING - POST-10 - DMR: extremities user_id='@__ANON__-0:localhost:33603' stream_id=3 /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,101 - synapse.handlers.device - 953 - WARNING - POST-10 - DMR: finished marking cache as valid /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,108 - synapse.access.http.8836 - 389 - INFO - POST-0 - 127.0.0.1 - 8836 - {@anon-20210915_174513-1:localhost:8840} Processed request: 0.042sec/-0.000sec (0.008sec, 0.000sec) (0.004sec/0.004sec/6) 440B 200 "POST /_matrix/clien t/r0/keys/query?access_token= HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts] /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,112 - synapse.handlers.e2e_keys - 117 - WARNING - POST-1 - DMR: query_devices /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,112 - synapse.handlers.e2e_keys - 137 - WARNING - POST-1 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:33603': []} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,113 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:33603': None} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:33603'} /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 227 - WARNING - POST-1 - DMR: fetch user_id='@__ANON__-0:localhost:33603', device_list=[] /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 231 - WARNING - POST-1 - DMR: AAAAA /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 236 - WARNING - POST-1 - DMR: BBBBB room_ids=frozenset({'!PSQPVgzGgcyCSphXvA:localhost:8840'}) /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 240 - WARNING - POST-1 - DMR: CCCCC /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.storage.databases.main.devices - 481 - WARNING - POST-1 - DMR: users_needing_resync=set() /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,115 - synapse.handlers.e2e_keys - 251 - WARNING - POST-1 - DMR: slave /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,116 - synapse.handlers.device - 844 - WARNING - POST-12 - DMR: user_device_resync for user_id='@__ANON__-0:localhost:33603' /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,132 - synapse.http.matrixfederationclient - 670 - WARNING - POST-1 - {POST-O-1} [localhost:33603] Request failed: POST matrix://localhost:33603/_matrix/federation/v1/user/keys/query: HttpResponseException('400: Bad Request') /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,133 - synapse.access.http.8836 - 389 - INFO - POST-1 - 127.0.0.1 - 8836 - {@anon-20210915_174513-1:localhost:8840} Processed request: 0.021sec/-0.000sec (0.004sec, 0.000sec) (0.001sec/0.001sec/3) 156B 200 "POST /_matrix/clien t/r0/keys/query?access_token= HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts] ```

The stream_writer never gets the replication message (I'm assuming we kill the server before it has chance to arrive.) So we get an old (incorrect) cache hit:

 /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,113 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:33603': None}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:33603'}

DMRobertson commented 3 years ago

Unfortunately I don't think retry_until_success works either. We end up with a different kind of race!

POST-0: initial request, device list written on master. publish notification on replication
POST-1: yet to receive replication notice, so we make a federation request for device keys. This returns 400. We mark the device as requiring resync.
Replication notification arrives.
POST-2: we recognise the device in question as being in the DB's cache. But it's marked as requiring resync, so we make a federation request, which returns 400. We get stuck in a loop.

DMRobertson commented 3 years ago

Also https://github.com/matrix-org/synapse/runs/3630907641, but I think that's running against a release branch in sytest that doesn't have the above PR (merged to sytest develop)

DMRobertson commented 3 years ago

Also https://github.com/matrix-org/sytest/pull/1132/checks?check_run_id=3624153208 .

DMRobertson commented 2 years ago

I think https://github.com/matrix-org/synapse/runs/5752060971?check_suite_focus=true hit this.

DMRobertson commented 2 years ago

And https://github.com/matrix-org/synapse/runs/5971866118?check_suite_focus=true

DMRobertson commented 2 years ago

And https://github.com/matrix-org/synapse/runs/6094736905?check_suite_focus=true

DMRobertson commented 2 years ago

https://github.com/matrix-org/synapse/runs/6496065804?check_suite_focus=true