Open clokep opened 3 years ago
Well, I mean to make a PR for this but screwed up my git branch management locally.
Let's see how this goes?
Just noting that I had a failure on this test as well today (with several re-runs)
I resorted to sticking debug statements into synapse. My understanding: on the first keys query,
self.device_handler.device_list_updater.user_device_resync
on the master process (it seems device lists are always handled by the master)update_remote_device_list_cache
device_lists_remote_extremeties
user_id='@__ANON__-0:localhost:45313' stream_id=3
All normal operation as far as I can see. On the second keys query,
get_user_devices_from_cache
we call get_device_list_last_stream_id_for_remotes
which reads from device_lists_remote_extremeties
user_map={'@__ANON__-0:localhost:41307': None}
Guessing there's a race between the cache invalidation and the second device keys call?? Commenting out the @cachedList
call makes the test pass (with user_map={'@__ANON__-0:localhost:34477': '3'}
).
Notes:
get_device_list_last_stream_id_for_remotes
function is decorated with @cachedList
Oh bleurgh, that's annoying.
The other way we deal with this sort of thing is by using retry_until_success { ... }
to wrap the test code, which will repeatedly run it until it passes (with some appropriate exponential backoff). The downside with this approach is that failures will result in the test timing out, rather than in a more helpful manner, so it's important to add helpful logging. An example usage:
So I think we can wrap the second call to fetch keys in that and it should work?
Thanks @erikjohnston, that sounds like a good avenue of attack.
For completeness, I wanted to demonstrate the race in the logs. Here's a version where I've made sytest sleep 1;
between key requests.
In particular, note that the stream_writer is told over replication about the device lists having changed:
/tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,924 - root - 70 - WARNING - process-replication-data-40 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
So we get a correct cache hit:
/tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,826 - synapse.handlers.e2e_keys - 137 - WARNING - POST-1 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:35035': []}
/tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,833 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:35035': '3'}
/tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,836 - synapse.storage.databases.main.devices - 481 - WARNING - POST-1 - DMR: users_needing_resync=set()
/tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,840 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache=set()
Here's the version without the sleep.
The stream_writer never gets the replication message (I'm assuming we kill the server before it has chance to arrive.) So we get an old (incorrect) cache hit:
/tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,113 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:33603': None}
/tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:33603'}
Unfortunately I don't think retry_until_success
works either. We end up with a different kind of race!
Also https://github.com/matrix-org/synapse/runs/3630907641, but I think that's running against a release branch in sytest that doesn't have the above PR (merged to sytest develop)
https://github.com/matrix-org/synapse/runs/3570688017
I think this was just re-enabled?