matrix-org / complement-crypto

Apache License 2.0
11 stars 5 forks source link

rust: MemoryStore wedging? #77

Closed kegsay closed 4 months ago

kegsay commented 5 months ago

See https://github.com/matrix-org/matrix-rust-sdk/actions/runs/9351153389/job/25736195990?pr=3496

Times out on pushNotifEventID = bob.SendMessage(t, roomID, text).

Upon investigation, the logs show that the rust SDK did not send the event. It seems to get stuck after GET /state/m.room.encryption/ as there is no log line for "Sending encrypted event because the room is encrypted" but there is a 200 OK response to that GET. There is 1 lock that is taken between the two bits of code:

https://github.com/matrix-org/matrix-rust-sdk/blob/2cf36c7a5ed60c75824bea80422a8dbf829b1bed/crates/matrix-sdk/src/room/mod.rs#L486

I guess something is stopping the send_raw span from taking that lock, probably the next_sync_with_lock > sync_once{pos="1"} span which seems to be doing stuff at the same time.

kegsay commented 5 months ago

TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1} has the same failure mode: key_backup_test.go:102: SendMessage(rust) @user-24-alice:hs1: timed out after 11s - it fails on the sanity check that you can send messages into an encrypted room... fails in exactly the same way:

2024-06-04T12:01:27.286925Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1](rust) SendMessage !GtBcsajlajtqiFKSSY:hs1 => An encrypted message | rust.go:0
2024-06-04T12:01:27.287076Z DEBUG log: timeline    
2024-06-04T12:01:27.287153Z DEBUG log: message_event_content_from_html    
2024-06-04T12:01:27.287180Z DEBUG log: send    
2024-06-04T12:01:27.287594Z DEBUG matrix_sdk_ui::timeline::event_handler: Handling local event | crates/matrix-sdk-ui/src/timeline/event_handler.rs:296 | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287700Z DEBUG log: processing CharacterTokens(NotSplit, Tendril<UTF8>(shared: \"An encrypted message\")) in insertion mode InBody     | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287748Z DEBUG log: processing EOF in state Data     | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287784Z DEBUG log: processing EOFToken in insertion mode InBody     | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287995Z DEBUG matrix_sdk_ui::timeline::queue: Spawning message-sending task | crates/matrix-sdk-ui/src/timeline/queue.rs:233 | spans: send_queued_messages{room_id="!GtBcsajlajtqiFKSSY:hs1"}
2024-06-04T12:01:27.288072Z DEBUG log: log_event    

local echo:
2024-06-04T12:01:27.288103Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1]AddTimelineListener[!GtBcsajlajtqiFKSSY:hs1] TimelineDiff len=1 | rust.go:0
2024-06-04T12:01:27.288344Z DEBUG log: change    

check if room is encrypted:
2024-06-04T12:01:27.288338Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: send_raw{room_id="!GtBcsajlajtqiFKSSY:hs1" transaction_id="8232fdcf818d424da8ce3ef774741505"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-12" method=GET uri="http://127.0.0.1:32794/_matrix/client/v3/rooms/!GtBcsajlajtqiFKSSY:hs1/state/m.room.encryption/"}
2024-06-04T12:01:27.288381Z DEBUG log: push_back    
2024-06-04T12:01:27.288438Z DEBUG log: as_event    
2024-06-04T12:01:27.288478Z DEBUG log: event_id    
2024-06-04T12:01:27.288543Z DEBUG log: sender    
2024-06-04T12:01:27.288680Z DEBUG log: content    
2024-06-04T12:01:27.288717Z DEBUG log: kind    
2024-06-04T12:01:27.288746Z DEBUG log: content    
2024-06-04T12:01:27.288766Z DEBUG log: as_message    
2024-06-04T12:01:27.288795Z DEBUG log: body    
2024-06-04T12:01:27.288844Z DEBUG log: log_event    
2024-06-04T12:01:27.288870Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1]_______ PUSH BACK &{ID: Text:An encrypted message Sender:@user-24-alice:hs1 Target: Membership: FailedToDecrypt:false}
 | rust.go:0
2024-06-04T12:01:27.288910Z DEBUG log: log_event    
2024-06-04T12:01:27.288930Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1]TimelineDiff change: &{ID: Text:An encrypted message Sender:@user-24-alice:hs1 Target: Membership: FailedToDecrypt:false} | rust.go:0
2024-06-04T12:01:27.292769Z DEBUG matrix_sdk::sliding_sync: Received response | crates/matrix-sdk/src/sliding_sync/mod.rs:662 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.292871Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.292937Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.293129Z DEBUG matrix_sdk_crypto::olm::account: The fallback key either expired or we didn't have one: generated a new fallback key. removed_fallback_key=None | crates/matrix-sdk-crypto/src/olm/account.rs:596 | spans: next_sync_with_lock > sync_once > handle_response > preprocess_to_device_events > receive_sync_changes
2024-06-04T12:01:27.294744Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: sync_once{pos="1"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-11" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="420 B" request_body=b"{\"conn_id\":\"room-list\",\"txn_id\":\"ec7022eeceea4f709c53dd22075ed086\",\"lists\":{\"all_rooms\":{\"ranges\":[[0,0]]},\"visible_rooms\":{\"ranges\":[[0,19]],\"sort\":[\"by_recency\",\"by_name\"],\"required_state\":[[\"m.room.encryption\",\"\"],[\"m.room.member\",\"$LAZY\"]],\"timeline_limit\":20,\"include_heroes\":true,\"filters\":{\"is_tombstoned\":false,\"not_room_types\":[\"m.space\"]},\"bump_event_types\":[\"m.room.message\",\"m.room.encrypted\",\"m.sticker\"]}}}" status=200 response_size="1.2 kiB"}
2024-06-04T12:01:27.294852Z DEBUG matrix_sdk::sliding_sync: Received response | crates/matrix-sdk/src/sliding_sync/mod.rs:662 | spans: sync_once{pos="1"}
2024-06-04T12:01:27.294945Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: sync_once{pos="1"}
2024-06-04T12:01:27.294995Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: sync_once{pos="1"} > handle_response

got response to encrypted room check, this is the last line which has send_raw in the span.
NB: we do not see "Sending encrypted event because the room is encrypted" anywhere here:
2024-06-04T12:01:27.296228Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: send_raw{room_id="!GtBcsajlajtqiFKSSY:hs1" transaction_id="8232fdcf818d424da8ce3ef774741505"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-12" method=GET uri="http://127.0.0.1:32794/_matrix/client/v3/rooms/!GtBcsajlajtqiFKSSY:hs1/state/m.room.encryption/" status=200 response_size="36 B"}
2024-06-04T12:01:27.300082Z DEBUG matrix_sdk_base::store::memory_store: Saved changes in 12.593µs | crates/matrix-sdk-base/src/store/memory_store.rs:477 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.300196Z DEBUG matrix_sdk::sync: Ran event handlers in 17.663µs | crates/matrix-sdk/src/sync.rs:228 | spans: next_sync_with_lock > sync_once > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.300209Z  WARN matrix_sdk_crypto::backups: Trying to backup room keys but no backup key was found | crates/matrix-sdk-crypto/src/backups/mod.rs:520
2024-06-04T12:01:27.300238Z DEBUG matrix_sdk::sync: Ran notification handlers in 7.995µs | crates/matrix-sdk/src/sync.rs:253 | spans: next_sync_with_lock > sync_once > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.300298Z DEBUG matrix_sdk::sliding_sync: Sliding Sync response has been handled by the client sync_response=SyncResponse { rooms: Rooms { leave: {}, join: {}, invite: {} }, account_data: [], to_device: [], notifications: {}, .. } | crates/matrix-sdk/src/sliding_sync/mod.rs:355 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.300354Z DEBUG matrix_sdk::sliding_sync: Update lists lists={} | crates/matrix-sdk/src/sliding_sync/mod.rs:420 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.300448Z DEBUG matrix_sdk::sliding_sync: Done handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:704 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.300575Z DEBUG matrix_sdk::sliding_sync: Sync stream is running | crates/matrix-sdk/src/sliding_sync/mod.rs:729 | spans: next_sync_with_lock
2024-06-04T12:01:27.300665Z DEBUG matrix_sdk::sliding_sync::cache: loading sliding sync sliding_sync_store::encryption::@user-24-alice:hs1 state from DB finished in 0ms | crates/matrix-sdk/src/sliding_sync/cache.rs:200 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.300709Z DEBUG matrix_sdk::sliding_sync: Sending request | crates/matrix-sdk/src/sliding_sync/mod.rs:597 | spans: next_sync_with_lock > sync_once{pos="1"}
2024-06-04T12:01:27.300890Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: next_sync_with_lock > sync_once{pos="1"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-13" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="65 B" request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"0\"}}}"}
2024-06-04T12:01:27.301180Z DEBUG matrix_sdk_crypto::session_manager::sessions: Collected user/device pairs that are missing an Olm session missing_session_devices_by_user={} timed_out_devices_by_user={} | crates/matrix-sdk-crypto/src/session_manager/sessions.rs:301 | spans: get_missing_sessions
2024-06-04T12:01:27.301863Z DEBUG matrix_sdk::encryption: Uploading public encryption keys device_keys=false one_time_key_count=0 | crates/matrix-sdk/src/encryption/mod.rs:446 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"}
2024-06-04T12:01:27.302058Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-14" method=POST uri="http://127.0.0.1:32794/_matrix/client/v3/keys/upload" request_size="269 B"}
2024-06-04T12:01:27.303197Z DEBUG matrix_sdk_base::read_receipts: Starting. read_receipts=RoomReadReceipts { num_unread: 0, num_notifications: 0, num_mentions: 0, latest_active: Some(LatestReadReceipt { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M" }), pending: RingBuffer { inner: [] } } | crates/matrix-sdk-base/src/read_receipts.rs:463 | spans: sync_once{pos="1"} > handle_response > compute_unread_counts{room_id=!GtBcsajlajtqiFKSSY:hs1}
2024-06-04T12:01:27.303391Z DEBUG matrix_sdk_base::read_receipts: saving better | crates/matrix-sdk-base/src/read_receipts.rs:335 | spans: sync_once{pos="1"} > handle_response > compute_unread_counts{room_id=!GtBcsajlajtqiFKSSY:hs1} > try_match_implicit > try_select_later{event_id="$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M" event_pos=0 prev_pos=Some(0) prev_receipt=None}
2024-06-04T12:01:27.303502Z DEBUG matrix_sdk_base::read_receipts: after finding a better receipt read_receipts=RoomReadReceipts { num_unread: 0, num_notifications: 0, num_mentions: 0, latest_active: Some(LatestReadReceipt { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M" }), pending: RingBuffer { inner: [] } } | crates/matrix-sdk-base/src/read_receipts.rs:510 | spans: sync_once{pos="1"} > handle_response > compute_unread_counts{room_id=!GtBcsajlajtqiFKSSY:hs1}
2024-06-04T12:01:27.303736Z DEBUG matrix_sdk_base::store::memory_store: Saved changes in 143.069µs | crates/matrix-sdk-base/src/store/memory_store.rs:477 | spans: sync_once{pos="1"} > handle_response
2024-06-04T12:01:27.304635Z DEBUG matrix_sdk::sync: Ran event handlers in 765.124µs | crates/matrix-sdk/src/sync.rs:228 | spans: sync_once{pos="1"} > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.304674Z DEBUG matrix_sdk::sync: Ran notification handlers in 8.706µs | crates/matrix-sdk/src/sync.rs:253 | spans: sync_once{pos="1"} > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.304718Z DEBUG matrix_sdk::sliding_sync: Sliding Sync response has been handled by the client sync_response=SyncResponse { rooms: Rooms { leave: {}, join: {"!GtBcsajlajtqiFKSSY:hs1": JoinedRoom { unread_notifications: UnreadNotificationsCount { highlight_count: 0, notification_count: 0 }, timeline: Timeline { limited: false, prev_batch: Some("t7-165_32_0_1_34_1_17_111_0_1"), events: [SyncTimelineEvent { event: RawEvent { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M", event_type: "m.room.name", .. } }] }, state: [RawEvent { event_id: "$ridkUXUuqyRKn3I0lf8H1eXxzZxi9X2uv2wbuMZnCYM", event_type: "m.room.member", .. }, RawEvent { event_id: "$BokDim-BS45fEQFB-M3T8EBBcj67bhi9E-1fL3SGexQ", event_type: "m.room.encryption", .. }, RawEvent { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M", event_type: "m.room.name", .. }], account_data: [], ephemeral: [], ambiguity_changes: {} }}, invite: {} }, account_data: [], to_device: [], notifications: {}, .. } | crates/matrix-sdk/src/sliding_sync/mod.rs:355 | spans: sync_once{pos="1"} > handle_response
2024-06-04T12:01:27.304900Z DEBUG matrix_sdk::sliding_sync: Update lists lists={"all_rooms": SyncList { ops: [], count: 1 }, "visible_rooms": SyncList { ops: [SyncOp { op: Sync, range: Some((0, 0)), index: None, room_ids: ["!GtBcsajlajtqiFKSSY:hs1"], room_id: None }], count: 1 }} | crates/matrix-sdk/src/sliding_sync/mod.rs:420 | spans: sync_once{pos="1"} > handle_response
2024-06-04T12:01:27.305343Z DEBUG matrix_sdk::sliding_sync: Done handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:704 | spans: sync_once{pos="1"}
2024-06-04T12:01:27.305484Z DEBUG matrix_sdk::sliding_sync: Sync stream is running | crates/matrix-sdk/src/sliding_sync/mod.rs:729
2024-06-04T12:01:27.305733Z DEBUG matrix_sdk::sliding_sync: Sending request | crates/matrix-sdk/src/sliding_sync/mod.rs:597 | spans: sync_once{pos="2"}
2024-06-04T12:01:27.306048Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: sync_once{pos="2"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-15" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="100 B" request_body=b"{\"conn_id\":\"room-list\",\"lists\":{\"all_rooms\":{\"ranges\":[[0,0]]},\"visible_rooms\":{\"ranges\":[[0,19]]}}}"}
2024-06-04T12:01:27.309572Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: next_sync_with_lock > sync_once{pos="1"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-13" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="65 B" request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"0\"}}}" status=200 response_size="181 B"}
2024-06-04T12:01:27.310776Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-14" method=POST uri="http://127.0.0.1:32794/_matrix/client/v3/keys/upload" request_size="269 B" status=200 response_size="48 B"}
2024-06-04T12:01:27.310856Z DEBUG matrix_sdk_crypto::olm::account: Marking one-time keys as published | crates/matrix-sdk-crypto/src/olm/account.rs:1203 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"}
2024-06-04T12:01:27.316778Z DEBUG matrix_sdk::sliding_sync: Received response | crates/matrix-sdk/src/sliding_sync/mod.rs:662 | spans: next_sync_with_lock > sync_once{pos="1"}
2024-06-04T12:01:27.316838Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: next_sync_with_lock > sync_once{pos="1"}
2024-06-04T12:01:27.316871Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: next_sync_with_lock > sync_once{pos="1"} > handle_response

no logs here for 11s as things time out

2024-06-04T12:01:38.287481Z DEBUG log: stop    
2024-06-04T12:01:38.288200Z DEBUG matrix_sdk::sliding_sync: Sync stream has received an internal message internal_message=Ok(SyncLoopStop) | crates/matrix-sdk/src/sliding_sync/mod.rs:737
2024-06-04T12:01:38.288237Z DEBUG matrix_sdk::sliding_sync: Sync stream has exited. | crates/matrix-sdk/src/sliding_sync/mod.rs:778
2024-06-04T12:01:38.288406Z DEBUG matrix_sdk::sliding_sync: Sync stream has received an internal message internal_message=Ok(SyncLoopStop) | crates/matrix-sdk/src/sliding_sync/mod.rs:737 | spans: next_sync_with_lock
2024-06-04T12:01:38.288450Z DEBUG matrix_sdk::sliding_sync: Sync stream has exited. | crates/matrix-sdk/src/sliding_sync/mod.rs:778 | spans: next_sync_with_lock
2024-06-04T12:01:38.288704Z DEBUG log: log_event    
2024-06-04T12:01:38.288749Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1](rust) Close | rust.go:0

Just before the encryption event response is returned, https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk/src/sliding_sync/mod.rs#L294 is logged, which later on acquires the offending lock https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk/src/sliding_sync/mod.rs#L331 so I'm fairly sure this is still blocked and is blocking send_raw from making forward progress somehow. It does seem to release the lock given we see https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk/src/sliding_sync/mod.rs#L355 later in the logs though :S

kegsay commented 5 months ago

There was another flake in https://github.com/matrix-org/matrix-rust-sdk/actions/runs/9503969571/job/26195597830?pr=3539 - this time we got past the encryption state and see "Sending encrypted event because the room is encrypted" for Charlie's message send, but then wedge after getting the /members response. The code after getting the response calls receive_all_members which does lots of things and ends with:

        let _sync_lock = self.sync_lock().lock().await;
        let mut room_info = room.clone_info();
        room_info.mark_members_synced();
        changes.add_room(room_info);

        self.store.save_changes(&changes).await?;
        self.apply_changes(&changes, false);

..once again implicating either the sync lock or the data store (via save_changes). Further investigation on previous flakey tests show log lines in the sync code appearing which only appear after releasing the sync lock which should have allowed forward progress on the send event task, which makes myself (and @poljar ) suspect that the sync lock is blameless in this.

kegsay commented 5 months ago

The sync lock seems to be consistently held throughout calls to save_changes afaict. This means it isn't possible to have 2 concurrent tasks inside save_changes at the same time, meaning there must be some other cause here.

kegsay commented 4 months ago
2024-06-26T17:50:15.679813Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: sync_once{pos="1"}
2024-06-26T17:50:15.679858Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: sync_once{pos="1"} > handle_response
2024-06-26T17:50:15.681704Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:193 | spans: sending_task{room_id=!QHAxGOXvXFVwmJzIiE:hs1} > send_raw{room_id="!QHAxGOXvXFVwmJzIiE:hs1" transaction_id="50da7ffd3d3042f29397de91d214848f" is_room_encrypted=true} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s, retry_limit: 3 } request_id="REQ-12" method=GET uri="http://127.0.0.1:32781/_matrix/client/v3/rooms/!QHAxGOXvXFVwmJzIiE:hs1/members" status=200 response_size="982 B"}
2024-06-26T17:50:15.682925Z DEBUG matrix_sdk_base::store::memory_store: Saved changes in 13.846µs | crates/matrix-sdk-base/src/store/memory_store.rs:519 | spans: next_sync_with_lock > sync_once > handle_response

For https://github.com/matrix-org/matrix-rust-sdk/actions/runs/9684285011/job/26721675685 - same thing, wedging after /members response.

kegsay commented 4 months ago

And https://github.com/matrix-org/matrix-rust-sdk/actions/runs/9698745771/job/26766021784?pr=3619#step:16:244

kegsay commented 4 months ago

It reproduced with trace logging. It deadlocks somewhere in this code block

kegsay commented 4 months ago
2024-07-09T08:00:29.544568Z TRACE matrix_sdk_base::store::memory_store: starting | crates/matrix-sdk-base/src/store/memory_store.rs:289 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544609Z TRACE matrix_sdk_base::store::memory_store: profiles | crates/matrix-sdk-base/src/store/memory_store.rs:296 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544659Z TRACE matrix_sdk_base::store::memory_store: ambiguity maps | crates/matrix-sdk-base/src/store/memory_store.rs:319 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544685Z TRACE matrix_sdk_ui::timeline::builder: Waiting for an event. | crates/matrix-sdk-ui/src/timeline/builder.rs:188 | spans: room_update_handler{room_id="!ZdajZoYbBHcccwoqPi:hs1"}
2024-07-09T08:00:29.544709Z TRACE matrix_sdk_base::store::memory_store: account data | crates/matrix-sdk-base/src/store/memory_store.rs:331 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544739Z TRACE matrix_sdk_base::store::memory_store: room account data | crates/matrix-sdk-base/src/store/memory_store.rs:339 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544743Z TRACE matrix_sdk_ui::timeline::inner: Forcing update of sender profiles: {"@user-38-bob:hs1"} | crates/matrix-sdk-ui/src/timeline/inner/mod.rs:1062 | spans: room_update_handler{room_id="!ZdajZoYbBHcccwoqPi:hs1"}
2024-07-09T08:00:29.544763Z TRACE matrix_sdk_base::store::memory_store: room state | crates/matrix-sdk-base/src/store/memory_store.rs:352 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544784Z TRACE matrix_sdk_base::store::memory_store: room state: got room_state lock | crates/matrix-sdk-base/src/store/memory_store.rs:355 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes

..and then it wedges. So it blocks on let mut stripped_room_state = self.stripped_room_state.write().unwrap();

kegsay commented 4 months ago

This implicates this block of code:

    async fn get_state_events(
        &self,
        room_id: &RoomId,
        event_type: StateEventType,
    ) -> Result<Vec<RawAnySyncOrStrippedState>> {
        fn get_events<T>(
            state_map: &HashMap<OwnedRoomId, HashMap<StateEventType, HashMap<String, Raw<T>>>>,
            room_id: &RoomId,
            event_type: &StateEventType,
            to_enum: fn(Raw<T>) -> RawAnySyncOrStrippedState,
        ) -> Option<Vec<RawAnySyncOrStrippedState>> {
            let state_events = state_map.get(room_id)?.get(event_type)?;
            Some(state_events.values().cloned().map(to_enum).collect())
        }

        Ok(get_events(
            &self.stripped_room_state.read().unwrap(),
            room_id,
            &event_type,
            RawAnySyncOrStrippedState::Stripped,
        )
        .or_else(|| {
            get_events(
                &self.room_state.read().unwrap(),
                room_id,
                &event_type,
                RawAnySyncOrStrippedState::Sync,
            )
        })
        .unwrap_or_default())
    }

because this block acquires a read lock on stripped_room_state then potentially tries to acquire a read lock on room_state which it can't do - it's an ABBA deadlock.

I need confirmation from rust folks if this is how the locking works.

kegsay commented 4 months ago

@poljar confirms that the stripped_room_state lock is still held when the room_state lock is acquired. As such, this would cause a deadlock.

kegsay commented 4 months ago

https://github.com/matrix-org/complement-crypto/pull/115 moves us away from using the memory store https://github.com/matrix-org/matrix-rust-sdk/pull/3668 fixes the memory store.