nasty race when matrix clients see history for a channel relayed by the frontier user before their own IRC client has joined.

ara4n commented 8 years ago

We don't know if a given user has permission to join an IRC channel unless they attempt to do so. So you can end up with a Matrix user joining a Matrix room, which causes their virtual user to join the irc room. If the user is denied access, then we kick them on Matrix. But in that time they may have seen some messages that other valid users bridged into the Matrix room. We need to better bridge ACLs so they can't try to join in the first place, or block them from joining the Matrix room until their IRC client has also joined. In short, we should never share history from other users unless the user can see it themselves

kegsay commented 8 years ago

This is hard. ASes were specifically designed NOT to be on the critical path for Matrix HS operations. This means it is not possible for us to synchronously attempt to join the #channel, get rejected, then reject the Matrix join request. The idea was that if you wanted that, you'd set up the mythical Policy Server.

ara4n commented 8 years ago

another solution to this would be per-msg ACLs.

ara4n commented 8 years ago

this race is particularly bad during bridge startup whilst it's syncing IRC clients onto the network, as the IRC side of the bridge may not have yet connected, whilst the Matrix side is still visible.

kegsay commented 8 years ago

Though this is mitigated if the startup period is small (e.g. on IPv6 the difference is an order of magnitude: 10 minutes vs 400+ minutes).

ara4n commented 7 years ago

Whilst this bug was originally specifically about leaking +i and +k history, we generally have a problem that history can be leaked to Matrix users for any channel before the IRC virtual user has joined the room. This is a major problem for Freenode, given it means users who are not connected to a channel can still see room history as relayed by the frontier user. Renaming this bug to cover the general problem.

In terms of mitigations:

We've already switched to IPv6, which speeds up the window before the bots have all connected to around 20 minutes
We can speed up the process of calculating the membership lists to be synced by using the new https://github.com/matrix-org/synapse/pull/1680 APIs - this will shave off another 15 minutes or so.
Fix https://github.com/matrix-org/matrix-appservice-irc/issues/303, which makes the race a complete catastrophe (which is what happened this weekend)
However, the race really shouldn't be happening at all. Options include:
- Implement per-message ACLs, and have the bridge specify per-message which Matrix users are allowed to view the message or not based on the connectivity of their IRC bot. This is obviously a tonne of work.
- Implement per-user bridged rooms, a la Leo's sidecar approach, giving each matrix user an entirely separate view of the room, which reflects precisely what their IRC bot sees. This is obviously a huge architectural change, and would scale much worse than the current setup.
- Don't relay messages for /anyone/ in a channel until all the virtual users for that room have joined (ugh, but possibly better than nothing)
- ...um, suggestions?

kegsay commented 7 years ago

Swapping to the new membership list API is the quick win here: it'll (hopefully) be more reliable (eg #303) than /sync for Freenode, and (hopefully) be much faster too, reducing the lag time.

As for the options to kill the race entirely:

Per-message ACLs are probably the most non-invasive, plausible solution, albeit involving the specification of a whole host of new features which currently do not exist in the bridge, Synapse or Matrix itself.
Per-user bridged rooms are basically infeasible. The amount of duplicate messages this would create would prevent the bridge from scaling in any sane way. Either Synapse or the bridge would blow up with the current load on the Freenode bridge using this technique.
Not relaying until all users are either joined or kicked (if they cannot join) doesn't kill the race entirely, so isn't a feasible option. If TCP connections drop silently (which they frequently do), messages will still be relayed until timeouts kick in (be it TCP level or ping timeouts). Even if this was regarded as "acceptable", what are the semantics if your connection does time out? Kick the corresponding Matrix user immediately from the room? That's a huge UX fail.

ara4n commented 7 years ago

I thought we had a solution earlier when discussing this early IRL: for the bridge to only forward traffic for a given room if all the virtual users were connected (i.e. option 3 from above), based on the view of current membership seen by the frontier user. However, this still seems to fail for all the reasons Kegan mentions above. So I think I'm forgetting what the subtlety was...

ara4n commented 5 years ago

Another solution on this could be to use https://github.com/Half-Shot/irc-conntrack in future to ensure that IRC connections rarely drop in the first place. Then, we can perhaps enforce that traffic is only ever bridged matrix->irc for a room if we believe all the expected Matrix users are currently joined to that channel.

kegsay commented 5 years ago

That sounds like a good solution to me. We never had time to unpick IRC connection management from bridge semantics in the past.

It does however come with a bunch of caveats, notably:

Not relaying until all users are either joined or kicked (if they cannot join) doesn't kill the race entirely, so isn't a feasible option. If TCP connections drop silently (which they frequently do), messages will still be relayed until timeouts kick in (be it TCP level or ping timeouts). Even if this was regarded as "acceptable", what are the semantics if your connection does time out? Kick the corresponding Matrix user immediately from the room? That's a huge UX fail.

By enforcing that traffic is only ever bridged matrix -> irc for a room if all the expected Matrix users are joined to the channel, we make the bridge less robust to connection interruptions. E.g if there is a 1% chance of a connection dying at any one time, and you have 50 users in a room, assuming the connections are sufficiently independent, the chance of any single connection dropping is 50%.

This may not be an issue in practice, given how networking is done on the IRC bridge (assuming that hasn't changed much in the past year or so).

ara4n commented 3 years ago

This may not be an issue in practice, given how networking is done on the IRC bridge (assuming that hasn't changed much in the past year or so).

From what we can see of the logs, this happens very rarely in practice these days. TCP connections pinging out to the bridge are rare, and if they happened, we should dutifully go and kick out the Matrix user from the room in question to preserve membership sync.

ara4n commented 3 years ago

@Half-Shot is this not solved by https://github.com/matrix-org/matrix-appservice-irc/pull/1337 (which implements option 3 from https://github.com/matrix-org/matrix-appservice-irc/issues/212#issuecomment-266396852)? Or is this not working due to the concern listed at https://github.com/matrix-org/matrix-appservice-irc/pull/1337#issuecomment-850944025?

Half-Shot commented 3 years ago

I believe this is solved by #1337 (and subsequent PRs) where we now (optionally) drop all traffic I->M for a given room if it's members are not all connected to IRC (and we subsequently will try to join and then kick anyone who explicitly cannot join). There is now a state event sent into rooms by the bridge when this happens, so people can follow along at home.

matrix-org / matrix-appservice-irc

nasty race when matrix clients see history for a channel relayed by the frontier user before their own IRC client has joined. #212