mozilla / janus-plugin-sfu

Janus plugin to act as a kind of SFU for game networking data.
Mozilla Public License 2.0
135 stars 40 forks source link

Intermittent error "Can't subscribe to a nonexistent publisher" despite that user being in the room #52

Closed AndrewJudson closed 4 years ago

AndrewJudson commented 4 years ago

Just started noticing this today, using Janus 0.9.0 and master of the SFU. When the client joins a room, it gets a list of all of the users - e.g. {a, b, c, d, e, f}. However, they may fail to subscribe to users {b, c, d} giving the above error, whereas other clients may fail to subscribe to {d, e, f} or other subsets - so I know that the users have not left (I also get a message when they leave). Perhaps this is related to the recent code change that looks up the publishers?

gfodor commented 4 years ago

certainly could be, would you mind trying it on a version of the SFU without that change? here's the commit to try 420220e3669fb6bee289cc4e81ea7d4d884594aa

gfodor commented 4 years ago

thanks for reporting!

gfodor commented 4 years ago

also feel free to review my closed PR, i am not even an intermediate rust programmer, so bugs may be obvious

gfodor commented 4 years ago

fwiw I did test this extensively with bots, but perhaps got lucky. what browser?

gfodor commented 4 years ago

Also, as noted on the PR, the change does introduce problems if you are joining multiple rooms with the same user, and publishing on rooms other than the first room joined. Based upon my understanding of the SFU this scenario was not possible anyway for other reasons, but if that is the scenario you are testing in somehow, that would explain it, so lmk.

AndrewJudson commented 4 years ago

I don't see the issue when I use the older commit. Unfortunately I am even more of a Rust novice... I am not doing anything unusual - users are all joining a single room and publishing in that room. I am using Chrome.

gfodor commented 4 years ago

OK I took a stab at updating the switchboard to deal with the case where a user is in more than one room (but has one publisher in a room.) Unlikely to fix your issue, but figured it was worth a shot since I didn't see any other obvious problems with the old version. Can you test this commit: 7a28b8f5edac16f00468e1082463133641a7ac85

gfodor commented 4 years ago

Also, just checking, but is there a chance you are using the same user_id for different sessions?

gfodor commented 4 years ago

For our user case we use a one-time session id for the user id in janus SFU, but if you are using the surrogate key for an account, for example, could expose the issue even if you are connecting in different browser tabs.

AndrewJudson commented 4 years ago

I am using a different user_id for each session. I will test the commit in the morning

gfodor commented 4 years ago

great thanks!

AndrewJudson commented 4 years ago

I still see this issue when I use 7a28b8f

gfodor commented 4 years ago

ok. would you mind walking me through what you are doing in more detail? this will be easier for me to address if i am able to reproduce the problem. as i mentioned, in our load testing with the new plugin, did not see any publisher failures across many runs with many connections.

AndrewJudson commented 4 years ago

I was able to replicate it by opening many windows of tiny.html from the SFU testing client, pointed at a local Janus through WSS. It seems to happen when you quickly open many windows while simultaneously closing a few of those (might also work with just opening a lot of windows but I have had more luck triggering it by doing this). Note that it even if you wait awhile after it starts triggering and open a new window, you still get the error, so I don't think it is a race from seeing occupants - attempting to subscribe, something else is going on.

gfodor commented 4 years ago

thanks for that info. i'll see how far i can get debugging it. for now, i guess just run the working version. odds are i will have to wave the white flag and roll this back :) maybe @mqp can give some insights to where my patch has gone wrong.

gfodor commented 4 years ago

oh here's an idea, the other PR I opened introduced multithreaded message handling, perhaps we are seeing problems from that - ie maybe the new code in this recent PR is not thread safe for some reason. you can test this by setting message_threads = 1 in the plugin cfg. mind trying?

AndrewJudson commented 4 years ago

I still get the error when message_threads = 1 in the plugin cfg

gfodor commented 4 years ago

ok thanks, good to know.

mqp commented 4 years ago

I'm pretty surprised, too, with message_threads = 1. I don't see anything obvious wrong with the new get_publisher implementation, so I assumed it was some sneaky race condition.

mqp commented 4 years ago

Okay, pretty sure I found the problem. See my comment in #50.