Possible split brain issue

vincentfretin commented 3 years ago

With latest master https://github.com/mozilla/janus-plugin-sfu/tree/3694c36040eea4de1e80b83447a3625156454956 I reproduced an issue that is quite hallucinating. I had several reported cases where a person said that after entering a room, they saw nobody. Okay, I thought it was just a connection problem on their end. But here a case I reproduced when testing with another user. First I assure you that the room id was really the same, we exchanged the link via mail or hmd.link :)

Chrome macOS (device from Alice) join the room, Chrome Ubuntu (device from Bob on different network) join the room, both see and hear each other
iPhone 6s join, he's alone in the room. With some logging on the frontend, it seems that the backend replied that there was nobody in the room. So the backend in process_join returned that there was nobody in the room get_users returned an empty list.
iPhone X (device from Alice) join, he sees only iPhone 6s (device from Alice)
iPad (device from Bob) join, he sees both iPhones

So for a given room id, we had two separate groups:

Chrome macOS and Chrome Ubuntu
iPhone 6s, iPhone X, iPad.

Parallel rooms. :D We triggered the issue at several occasions. With a single deployed janus instance with message_threads=3 or even with message_threads=1 Every time when we triggered the issue, it was because the iPhone 6s connected to the room...

Without restarting janus and tested right away, just using an other room id, we were 5 in the room...

I know this project is not maintained and I'm on my own, just wanted to leave a trace of the issue for archive. :D

I don't know much about Rust and threads, after reading the code I couldn't figure out what could go wrong.

Did you experience similar issues when you were using the code?

vincentfretin commented 3 years ago

I made additional changes for logging what's going on based on my feature/api-v15 branch (edit: now merged in master): https://github.com/Synantoo/janus-plugin-sfu/compare/master...Synantoo:bugfix/splitbrain

vincentfretin commented 3 years ago

I reverted in my branch the switchboard changes, let's see if I still reproduce the issue with the old switchboard.

mqp commented 3 years ago

Thanks for reporting. Unfortunately, I don't have any leads on what might cause a problem like this. There were some changes to the switchboard code made in early 2020 that might have been slightly fishy, because I had already left the Hubs project and was just helping out in my spare time, so I didn't personally test them as much as usual. I'm guessing that if there's a problem it was introduced around then.

vincentfretin commented 3 years ago

With the old switchboard I didn't reproduce the issue for now. Some leads on the regression or maybe not, just noticing the differences that are not simple for me :)

Previous code used Entry::Occupied: https://github.com/mozilla/janus-plugin-sfu/blob/2e70c53bbe8820e9ca067977b26c08eb95af5fef/src/switchboard.rs#L150

In the new code, why do we need to clone again the Arc<Session> here: https://github.com/mozilla/janus-plugin-sfu/blob/b0968c6e9df8f4c69e2e80ceb4b0fabf6ca91734/src/switchboard.rs#L155

vincentfretin commented 3 years ago

With the revert to the old switchboard, I have definitely less issues of someone in the room alone, not seeing others in the room. And I have less reports about audio issues mainly someone not hearing someone, but all the others could. Maybe there was an issue with relaying packets, maybe there was an issue with the switchboard.publisher_to_subscribers not listing everyone too. Or maybe not, I don't know.

vincentfretin commented 3 years ago

For the extra clone I talked about here https://github.com/mozilla/janus-plugin-sfu/blob/b0968c6e9df8f4c69e2e80ceb4b0fabf6ca91734/src/switchboard.rs#L155 I think it's just an error, before it's was room.clone() a string, when you adapted the code in this PR changing room to session, I think you left the clone there but I don't think it's needed and may be the source of the issues I had? https://github.com/mozilla/janus-plugin-sfu/pull/54/files#diff-d2dc2ff1202aef3dbe8f5811a2a923e929ba458fac1af5a341f9671b89dd2573L149 It's funny you changed the type RoomId to UserId here for the room parameter. I did a PR to fix the type here https://github.com/mozilla/janus-plugin-sfu/pull/63

mqp commented 3 years ago

I wlll take a closer look at this tonight and figure out what I make of that clone, I agree that based on the history it's suspicious.

mqp commented 3 years ago

Well, I don't think there's anything strange about that clone. The switchboard stores Arcs for sessions with the expectation that eventually, the session will leave the room it's joined, if any, and disconnect, at which point there will be no more outstanding references to it and it will get dropped. Every time we store a session handle in a different data structure, we clone the Arc and put a copy there. I don't think there's anything wrong with it.

I'll spend a bit of time looking over the switchboard code with your comments in mind and see if I can notice anything. I don't know how to reproduce the problem. Whenever I use my test page and connect a bunch of users in different tabs they seem to correctly wind up communicating.

Another thing I can do is make it so that you can dump the state of the switchboard via an admin message to the plugin -- then the next time you encounter the problem, we can hopefully get some useful information.

vincentfretin commented 3 years ago

Ok, thanks for the explanation about Arc. I should read the Rust documentation or tutorial to learn more about it.

It's really hard to reproduce the problem for me too. I never reproduced it alone with different tabs in same browser on the same PC, or different browsers in the same PC, or me with my PC and my iPad on my home Wi-Fi. I reproduced only the issue of parallels rooms when the old iPhone6s of my colleague was involved. My colleague did lots of tests with different devices, see this doc with the two sheets https://docs.google.com/spreadsheets/d/1LAZoti60M0j6c9_38DpdoqAIK6QO3rgz1ufCeqr57PE/edit?usp=sharing He did the tests on the same Wi-Fi network with different devices, it doesn't make any sense to me that just reverting the changes of the switchboard made this much differences. You can look at my branch, I kept some changes like join_kind, and reverted only the structures changes of the switchboard. I really need to keep going with the project, so may not be able to test any new changes to the switchboard in the coming weeks.

vincentfretin commented 3 years ago

I rebased my bugfix/splitbrain branch against latest master and deployed that.

vincentfretin commented 3 years ago

Really I don't see anything wrong in the code either. The real difference for me is the structure occupants: HashMap<RoomId, Vec<Arc<Session>>> Now it's publishers_by_room: MultiMap<RoomId, Arc<Session>> So I was wondering if it could be an issue in the multimap dependency we use, or how we use it. It's a little bit hard for me to look at it or to say if there are subtle differences between what we had with the occupants HashMap and the new publishers_by_room MultiMap (that is using HashMap internally) . https://github.com/havarnov/multimap/

I can maybe put back this old HashMap<RoomId, Vec<Arc<Session>>> for publishers_by_room (and modify leave_publisher to do something like what we had in leave_room) and test that. And do the same thing for subscribers_by_user.

vincentfretin commented 3 years ago

Old code:

self.occupants.entry(room).or_insert_with(Vec::new).push(session);

New code:

self.publishers_by_room.insert(room, session);

The insert does the following: https://github.com/havarnov/multimap/blob/5699a9bd50370139a737ccf974312a0767bd48ca/src/lib.rs#L170-L179

mqp commented 3 years ago

Yeah, I agree with you that it seems like that's one of the only relevant changes, but the multimap code isn't very complicated and it looks right. There must be something I'm missing.

Since it seems possibly related to which device is connecting, I've been thinking about what happens if clients do "weird" connection patterns, like reconnecting, trying to make multiple publisher connections, etc. but I can't think yet of any case that could cause the symptom you reported.

vincentfretin commented 3 years ago

I'm using naf-janus-adapter 3.0.20 with some additional fixes in my 3.0.x branch https://github.com/Synantoo/naf-janus-adapter/tree/3.0.x A network issue and so a NAF.connection.adapter.reconnect() may have happen at one point when the problem occurred, it's possible. And this will use the same user_id (NAF.clientId). So indeed it can do multiple publisher connections. Hum, in this case I'm concerned by this line self.publishers_by_user.entry(user).or_insert(session.clone()); It won't set the new session if the key already exist and so alert the other users to subscribe to the old session I think. naf-janus-adapter is saying "Janus still has previous session for this client. Reconnecting in 10s." in case the same user_id is in the room and perform a delayed reconnect, I saw that several times during my tests. I think that even after a reconnect, the old session may not have ended yet on the server, and the other participants may have again subscribed to the wrong session. This can explain a lot of issue I had like someone not hearing someone else in a room of 3, and just by refreshing the page (so changing its user_id) fixed the issue. I didn't receive any new reports of this issue since I reverted to the old switchboard. A case where the issue showed up was with an user connected with an iPhone from Singapore, an he may have been on the move, so switching between Wi-Fi and 4g as well... The janus server is also in France, so it really can be a reconnect case.

Looking at the multimap code, the insert function is doing a lot more instructions before setting the vector in case the key doesn't exist, first a self.entry(k) that is calling the self.inner(HashMap).entry then doing a match against HashMapEntry::Occupied/Vacant and creates a intermediate struct containing the HashMap entry, then doing a match against Entry::Occupied/Vacant, then creates a new vector, calls Entry::Vacant(MultiMap).insert_vec that calls self.inner(HashMapEntry::Vacant).insert inserting (or replacing) the vector in the HashMap. In comparison with the old code where a new vec is created right away if the key doesn't exist self.occupants.entry(room).or_insert_with(Vec::new).push(session);

Is this possible that two threads is executing the code at the same time and we end up with two self.inner(HashMapEntry::Vacant).insert, the second call replacing completely the vector so losing the first value that was set by the first call? From my understanding it shouldn't be possible with message_threads=1, process_join is executed one at a time, right? Even if there was simultaneous calls, it wouldn't explain the issue I had because we call get_users before modifying the structure with join_publisher. I know that janus is creating a thread for each peer connection media, but the plugin exists only in one thread right (if message_threads=1)?

Btw previously we removed the entry in occupants HashMap if the value was an empty vector with cohabitators.remove_entry(). Looking at the multimap code and the way we use it, I think the key is always kept in the HashMap with an empty vec.

mqp commented 3 years ago

If message_threads=1, that means that only one stack of handle_message is running at once. Janus could still call other plugin callbacks (e.g. create_session, destroy_session, incoming_rtp) concurrently from other threads while messages are being processed on the message handler thread.

However, I don't think a race on the switchboard structure can explain it, no matter what the settings are. The switchboard is protected by a reader-writer lock; for example, in process_join, we acquire the write lock once we validate that the join is authorized and don't release it until we are done with everything. The design of Rust's lock libraries makes it difficult to cause bugs of this nature.

Btw previously we removed the entry in occupants HashMap if the value was an empty vector with cohabitators.remove_entry(). Looking at the multimap code and the way we use it, I think the key is always kept in the HashMap with an empty vec.

I agree with that, although I don't think it's causing any correctness problems that I can see.

mqp commented 3 years ago

I'm inclined to get rid of the multimap dependency, though, since it's not buying us much except having one more thing to worry about. I probably shouldn't have introduced it.

vincentfretin commented 3 years ago

Yes, it's also used in the BidirectionalMultimap. I agree with you it doesn't buy us much, switching back to a HashMap<RoomId, Vec<Arc<Session>>> is about the same number lines of code, and it's one less dependency to worry about.

vincentfretin commented 3 years ago

In the join_publisher, I think we should modify self.publishers_by_user.entry(user).or_insert(session.clone()); to be: self.publishers_by_user.insert(user, session.clone()); to overwrite the value in case of publisher reconnect and possible race condition between others subscribing (using get_publisher) and the old publisher session ending after that, just in case.

vincentfretin commented 3 years ago

Actually there may be other issues with publisher reconnect and the old session ending after that because in leave_publisher and leave_subscriber we remove the entry in publishers_by_user by user_id: self.publishers_by_user.remove(&joined.user_id); In this case we are in a inconsistent state, until the client disconnect and that the new session is also ended by calling again leave_publisher for the new session. In naf-janus-adapter, it does a delayed reconnect of 10s until the user_id is not in in the room anymore (get_users on the server), so if the publishers_by_room.get_vec(room) doesn't include a session with this user_id.

vincentfretin commented 3 years ago

During my tests like switching off and on the network to trigger the reconnect by naf-janus-adapter, I could sometimes have 3 times the message "Janus still has previous session for this client. Reconnecting in 10s.". The first time ok it's the old session that is still in publishers_by_room.get_vec(room). then it waits 10s before doing the disconnect followed by connect, but I have again the "Janus still has previous session for this client. Reconnecting in 10s." message but I think it may be just the new session that is in publishers_by_room.get_vec(room) and the old wasn't there anymore. On the client side, disconnect() will close the peer connection, and connect() will create a new peer connection, so a new handle an so a new session. But the server may not have ended the session yet (it may take 1s I don't know) before creating the new session. Maybe we should force ending all publisher and subscribers session for the given user_id before processing the publisher join with the same user_id.

vincentfretin commented 3 years ago

For the force end session I was talking about, if we call end_session like it's done in process_kick, does this call our destroy_session right away or is it async? I guess it's async because we do SWITCHBOARD.read() there.

Why do we do SWITCHBOARD.write() in process_data instead of SWITCHBOARD.read()? We don't change anything as far as I can tell in process_data. Replacing it with let switchboard = SWITCHBOARD.read().expect("Switchboard lock poisoned; can't continue."); seems fine and would benefit from having message_threads > 1 because read() doesn't block other read(), but write() does. https://doc.rust-lang.org/std/sync/struct.RwLock.html And if using NAF.connection.adapter.reliableTransport = "websocket" (instead of default "datachannel") for chat and NAF.connection.adapter.unreliableTransport = "websocket" for component updates. We can go in this process_data a lot.

vincentfretin commented 3 years ago

Now on the client side I close the publisher peer connection right away if the user_id is already in the room (and not wait after 10s before closing it and trying again, because it can have a delay on the server ending this session. https://github.com/Synantoo/naf-janus-adapter/commit/ed601167d651a1708851bdd5c2fba6b88f17c163

and server side in process_join return right away if the user_id is already a publisher in the room, don't notify other participants, because we know that the client (naf-janus-adapter) will reconnect in 10s https://github.com/Synantoo/janus-plugin-sfu/commit/723218624e63b9891dc8bae61b8047609f90286b It may not be a change you want in master because it's related to how naf-janus-adapter is handling this case. [TODO I shouldn't do the addOccupant in this case in naf-janus-adapter otherwise I will hear audio from others for 10s, but they won't ever know I'm there and I'm listening and they can't hear me]

mqp commented 3 years ago

For the force end session I was talking about, if we call end_session like it's done in process_kick, does this call our destroy_session right away or is it async? I guess it's async because we do SWITCHBOARD.read() there.

I believe it's async. See the relevant Janus code here.

More on the rest of your notes later.

vincentfretin commented 3 years ago

From my previous comments, there is no inconsistency in the structures if we keep self.publishers_by_user.entry(user).or_insert(session.clone());, but there is if we self.publishers_by_user.insert(user, session.clone()); so don't change that line. :)

From my understanding with the new or even the old switchboard, this scenario can happen:

Alice is in the room with other participants, all is fine
Alice has a network error (I switch from wired to Wi-Fi), triggering a reconnect
Alice publisher reconnects with the same user_id, and notify all other participants that she joined
The other participants subscribe to the joined publisher, but they get the "Alice" publisher old session from publishers_by_user. The other participants don't hear Alice anymore. Alice can hear everyone else. (I confirmed this with the old switchboard just now and the new no-multimap branch)
The old session from Alice is finally ended, removing from publishers_by_user.
Alice reconnect again after 10s because naf-janus-adapter performed a delayed reconnect because a publisher session was still running on the server. And here we can have potentially 3x10s delayed reconnect until the old session is ended by janus. Closing the publisher peer connection right way like I said didn't fix anything, so this really is the old session that is still not ended yet.

I don't know what can be better here. On the client side, related to my app really, signal to Alice that the others can't hear her and signal the others that Alice currently has an audio issue. On server side, maybe trying to end the old session sooner by calling end_session on the old session for this user_id, still do the join_publisher (because the publisher is really in the room and will subscribe to others) and not notifying the other participants that someone joined (the other participants normally never received that Alice left and still have the connection to Alice anyway, so it's useless to do it again). We wait that the publisher has really a unique session before notifying the participants.

vincentfretin commented 3 years ago

I tested the following change https://github.com/Synantoo/janus-plugin-sfu/commit/de5e059d4019b6267ff5771303dba46ab76d893c on top of the no-multimap branch. This seems to help, I have only the message "Janus still has previous session for this client. Reconnecting in 10s." once now.

mqp commented 3 years ago

Thinking about the issues related to the publishers_by_user structure:

In general, when I was writing this code, I wasn't considering the case where someone tried to make two publisher connections to the same room with the same user ID. This was basically just an oversight, but it was an understandable one, because in the Hubs client at that time we generated random unique IDs per tab to serve as the "user ID" instead of using a meaningful ID to identify someone's account. So it really wasn't something we expected to happen with a normal client.

Nowadays, there are two cases where a normal client might reasonably want to have multiple publishers with the same user ID in the same room:

1) Connection from multiple devices at once, or multiple tabs. 2) A reconnect in which for whatever reason the new connection spins up while the server still thinks the old one is connected. It seems that this should behave just the same as if you opened a new tab, and then closed the first one quickly afterward.

In both of these cases, it seems like the best possible behavior is to just be OK with it, and treat each one as a fully operative connection. That is, when either one sends audio, video, or data, give it to everyone else in the room, and when other people in the room do things like send a FIR or send data, give it to all publishers.

So this at least implies that this publishers_by_user structure should allow multiple publishers per user. It's not clear to me what it says about other parts of the code.

What do you think?

vincentfretin commented 3 years ago

Today, I have a different user id when I refresh the page, or open a new tab for the same room id. Maybe it wasn't the case for Hubs? The user id was kept the same for the authenticated or unauthenticated user? In this case yes the reconnect should have behaved similar to opening a new tab and closing the previous one.

Anyway in the future I may want to be in the room several times with the same user id. I was asked for example to connect in a room on desktop, and connect to the room with the iPad for use mainly with a collaborative whiteboard in parallel. I could also potentially be connected in the room on both desktop and tablet, avatar and audio only with desktop, and using the camera of the iPad for example. It may not be a feature I'll implement, but there can be a need to have multiple publishers for the same user id in this case indeed.

mqp commented 3 years ago

Oh, I see. With Hubs, I'm not sure about unauthenticated users, but yeah, IIRC authenticated users had a fixed user ID.

vincentfretin commented 3 years ago

If we want to support multiple publishers for a user id, we will need changes in both the plugin and in naf-janus-adapter. Currently we create a handle and so a single RTCPeerConnection with a single audio track and a single video track per user. If a user id can have several publishers, that would mean several RTCPeerConnection for a given user id. I wouldn't know exactly what audio or video tracks to use in my app for a given user in this case, this is why I preferred to just end the old session in this case. Maybe once the janus multistream branch is merged, we could modify all this to have a single RTCPeerConnection with audio and several videos tracks when subscribing to a user. This would also allow a user to share both screen and camera from the same device or to share screen from a device and camera from another device.

vincentfretin commented 3 years ago

With https://github.com/mozilla/janus-plugin-sfu/pull/68 and https://github.com/mozilla/janus-plugin-sfu/pull/73#issuecomment-776649237 I think I can close this issue. We can keep discussing the multi publishers stuff here or create another issue to discuss it further, but it's not currently a high priority on my roadmap.

mqp commented 3 years ago

OK. If you're convinced the multimap change fixed your issue, I will probably take some time to really stare bullets into it and try to figure out what the deal is, since if it's a bug in the multimap library I'm sure upstream will appreciate it.

mozilla / janus-plugin-sfu

Possible split brain issue #62