matrix-org / matrix-rust-sdk

Matrix Client-Server SDK for Rust
Apache License 2.0
1.26k stars 252 forks source link

crypto: events can be sent without an up-to-date member list causing UTDs #3529

Closed kegsay closed 5 months ago

kegsay commented 5 months ago

I've seen two bug reports from the same user on EXA which shows the following:

The SDK is supposed to hit /members prior to sending encrypted events to ensure they have an up-to-date member list. This is done in https://github.com/matrix-org/matrix-rust-sdk/blob/0c97c85aea952887b3d7677932b96af492808a42/crates/matrix-sdk/src/room/futures.rs#L166-L168

I see no evidence of a /members hit when sending the event.

However, this is complicated by the several confounding factors:

These factors mean there's several things which could be going wrong here:

There's a lot of moving parts here but I've done due diligence:

poljar commented 5 months ago

Is there a race condition where you can send an event after you join but before the rust SDK layer believes you are joined, so sync_members doesn't actually sync members?

I don't think so, due to this check happening at the start of the send: https://github.com/matrix-org/matrix-rust-sdk/blob/abda95979691c4d51d6b7fc06a9777d9e32a795d/crates/matrix-sdk/src/room/futures.rs#L169

Which does this check:

https://github.com/matrix-org/matrix-rust-sdk/blob/abda95979691c4d51d6b7fc06a9777d9e32a795d/crates/matrix-sdk/src/room/mod.rs#L2524-L2529

Later on, the same piece of data is used to skip the /members request:

https://github.com/matrix-org/matrix-rust-sdk/blob/abda95979691c4d51d6b7fc06a9777d9e32a795d/crates/matrix-sdk/src/room/mod.rs#L517-L522

In conclusion, if the SDK doesn't think you are joined, it doesn't let you send a message.

Does the inability to upload OTKs cause the state machine to terminate early or something?

I think this is an unrelated issue, failing to upload OTKs would produce UTDs on the EX side, not on the EW side.

poljar commented 5 months ago

the app itself also asks for /members and then immediately cancels the job, causing kotlinx.coroutines.JobCancellationException: Job was cancelled; job=SupervisorJobImpl{Cancelling}@6f96c4a - this is normal behaviour when you back out of a room quickly before it can fully load.

I think this might be the problem, though the cancellation shouldn't set the members_synced flag to true, so next time you try to send a message it should retry the /members request.

To further expand, we have a request deduplication handler:

https://github.com/matrix-org/matrix-rust-sdk/blob/abda95979691c4d51d6b7fc06a9777d9e32a795d/crates/matrix-sdk/src/deduplicating_handler.rs#L29-L38

The handler allows the first caller to acquire a lock and the lock is held for the duration of the call. The second caller then tries to acquire the same lock, though it will do so only after the first caller is done.

The first caller gets the result of the operation and shares it through the lock with the rest of them. So a failure would be propagated correctly to the rest of the callers, but a cancellation, like EX does here, might not be a failure due to this critical line:

https://github.com/matrix-org/matrix-rust-sdk/blob/abda95979691c4d51d6b7fc06a9777d9e32a795d/crates/matrix-sdk/src/deduplicating_handler.rs#L65-L67

bnjbvr commented 5 months ago

the app itself also asks for /members and then immediately cancels the job, causing kotlinx.coroutines.JobCancellationException: Job was cancelled; job=SupervisorJobImpl{Cancelling}@6f96c4a - this is normal behaviour when you back out of a room quickly before it can fully load.

We also have a deduplicating mechanism for some queries, including /members: if there's already another caller doing this query, then the second query is deduplicated.

Now, I didn't think about cancellability when I implemented that deduplication (one extra proof that making it possible to cancel any future makes reasoning really hard). So what could happen:

Ideally, the SDK would "transmit" ownership of the deduplicated query to one of the waiters on the result, but I'm not sure that's what's happening here...

poljar commented 5 months ago

Ideally, the SDK would "transmit" ownership of the deduplicated query to one of the waiters on the result, but I'm not sure that's what's happening here...

I think we assume a successful result, look at my previous reply.

bnjbvr commented 5 months ago

Indeed, I hadn't seen your answer yet while typing mine. Good catch!

kegsay commented 5 months ago

So the failure scenario would be:

poljar commented 5 months ago

Cancel the room.members() call - How is this done at an FFI level? There is no cancel function or anything.

Not sure how it's done in this case, as I haven't checked what EXA is doing, but we provide a general cancellation primitive:

https://github.com/matrix-org/matrix-rust-sdk/blob/abda95979691c4d51d6b7fc06a9777d9e32a795d/bindings/matrix-sdk-ffi/src/task_handle.rs#L4-L39

This causes the member sync code to return a success when it shouldn't, which propagates to the send message code?

Yes, correct. The previous steps are correct as well.

bnjbvr commented 5 months ago

Fwiw, I'm working on adding a test showing the deduplicating handler failure (in this case), and making it more robust against this particular cancelling race.

This doesn't mean it will solve this particular issue entirely (since it's unclear whether that was the root cause or not), but at least it'll make the whole deduplication handling more robust.

poljar commented 5 months ago

Fwiw, I'm working on adding a test showing the deduplicating handler failure (in this case), and making it more robust against this particular cancelling race.

Ah nice, thanks.

kegsay commented 5 months ago

Manually confirmed that https://github.com/matrix-org/matrix-rust-sdk/pull/3533 fixes the issue. Whilst EXA still produces kotlinx.coroutines.JobCancellationException: Job was cancelled; the SDK does now retry /members requests.