matrix-org / matrix-rust-sdk

Matrix Client-Server SDK for Rust
Apache License 2.0
1.21k stars 240 forks source link

Cross-process lock doesn't seem to work #3313

Open richvdh opened 5 months ago

richvdh commented 5 months ago

We have a lock whose job it is to prevent both the main process and the NSE (notifications) process on Element X from using the OlmAccount at the same time. It also stops the two processes from performing an "encryption sync" (ie, a sliding-sync request in which we request to-device messages) concurrently.

However, we have evidence that it doesn't actually work. We have a rageshake where the main process makes a sync request:

2024-04-03T16:18:41.253319Z DEBUG matrix_sdk::http_client: Sending request 
  | crates/matrix-sdk/src/http_client/mod.rs:179 
  | spans: 
    root
    next_sync_with_lock
      > store_generation=26880
    sync_once
      > pos="29"
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
      > config=RequestConfig { timeout: 60s }
      > request_id="REQ-351"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="72 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
...
2024-04-03T16:19:02.361031Z DEBUG matrix_sdk::http_client: Got response 
  | crates/matrix-sdk/src/http_client/mod.rs:185 
  | spans: 
    root
    next_sync_with_lock
      > store_generation=26880
    sync_once
      > pos="29"
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
      > config=RequestConfig { timeout: 60s }
      > request_id="REQ-351"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="72 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
      > status=200
      > response_size="1.4 kiB"

... and overlapping with that, the NSE process makes another /sync request:

2024-04-03T16:18:59.168928Z DEBUG matrix_sdk::http_client: Sending request 
  | crates/matrix-sdk/src/http_client/mod.rs:179 
  | spans: 
    root
    get_notification
      > room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
      > event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
    retry_decryption
    run_fixed_iterations
      > store_generation=26881
    sync_once
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
      > config=RequestConfig { timeout: 7s }
      > request_id="REQ-209"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="155 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
...
2024-04-03T16:18:59.328466Z DEBUG matrix_sdk::http_client: Got response 
  | crates/matrix-sdk/src/http_client/mod.rs:185 
  | spans: 
    root
    get_notification
      > room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
      > event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
    retry_decryption
    run_fixed_iterations
      > store_generation=26881
    sync_once
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
      > config=RequestConfig { timeout: 7s }
      > request_id="REQ-209"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="155 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
      > status=200
      > response_size="1.6 kiB"

This shouldn't be possible, because both operations are done holding the cross-process lock (the main process in next_sync_with_lock, and the NSE process in run_fixed_iterations).

Now, I also see this in the log from the main process, in the middle of that next_sync_with_lock operation:

2024-04-03T16:18:51.391736Z  INFO matrix_sdk_common::store_locks: exiting the lease extension loop 
  | crates/matrix-sdk-common/src/store_locks.rs:237 
  | spans: 
    root

... which I believe means we are dropping the cross-process lock. That would obviously explain a lot of things, but I can't figure out how it happens.

richvdh commented 5 months ago

I'm not really sure how best to debug this. We could probably start by turning on all the trace logging in CrossProcessStoreLock. It might also be informative to give each CrossProcessStoreLockGuard a sequence number, and log when each is allocated and dropped.

Hywan commented 5 months ago

I'll try to check.

kegsay commented 4 months ago

I think I repro'd this. It's flakey though. https://github.com/matrix-org/complement-crypto/actions/runs/9094842308/job/24996764717?pr=52#step:15:790 shows TestMultiprocessDupeOTKUpload failing. That test:

This is exactly what the test shows: notification_test.go:466: /keys/upload returned an error, duplicate key upload? POST http://hs1:8008/_matrix/client/v3/keys/upload (token=syt_dXNlci0zNS1hbGljZQ_YwyfILDCncfsOAbiZBSL_3CgJtE) req_len=538 => HTTP 400

kegsay commented 3 months ago

I'll try to come back to this next week to see if I can get a reliable repro case.