Open richvdh opened 5 months ago
I'm not really sure how best to debug this. We could probably start by turning on all the trace
logging in CrossProcessStoreLock
. It might also be informative to give each CrossProcessStoreLockGuard
a sequence number, and log when each is allocated and dropped.
I'll try to check.
I think I repro'd this. It's flakey though. https://github.com/matrix-org/complement-crypto/actions/runs/9094842308/job/24996764717?pr=52#step:15:790 shows TestMultiprocessDupeOTKUpload
failing. That test:
NotificationClient.GetNotification
is called.This is exactly what the test shows: notification_test.go:466: /keys/upload returned an error, duplicate key upload? POST http://hs1:8008/_matrix/client/v3/keys/upload (token=syt_dXNlci0zNS1hbGljZQ_YwyfILDCncfsOAbiZBSL_3CgJtE) req_len=538 => HTTP 400
I'll try to come back to this next week to see if I can get a reliable repro case.
We have a lock whose job it is to prevent both the main process and the NSE (notifications) process on Element X from using the OlmAccount at the same time. It also stops the two processes from performing an "encryption sync" (ie, a sliding-sync request in which we request to-device messages) concurrently.
However, we have evidence that it doesn't actually work. We have a rageshake where the main process makes a
sync
request:... and overlapping with that, the NSE process makes another
/sync
request:This shouldn't be possible, because both operations are done holding the cross-process lock (the main process in
next_sync_with_lock
, and the NSE process inrun_fixed_iterations
).Now, I also see this in the log from the main process, in the middle of that
next_sync_with_lock
operation:... which I believe means we are dropping the cross-process lock. That would obviously explain a lot of things, but I can't figure out how it happens.