Closed AndrewJudson closed 4 years ago
certainly could be, would you mind trying it on a version of the SFU without that change? here's the commit to try 420220e3669fb6bee289cc4e81ea7d4d884594aa
thanks for reporting!
also feel free to review my closed PR, i am not even an intermediate rust programmer, so bugs may be obvious
fwiw I did test this extensively with bots, but perhaps got lucky. what browser?
Also, as noted on the PR, the change does introduce problems if you are joining multiple rooms with the same user, and publishing on rooms other than the first room joined. Based upon my understanding of the SFU this scenario was not possible anyway for other reasons, but if that is the scenario you are testing in somehow, that would explain it, so lmk.
I don't see the issue when I use the older commit. Unfortunately I am even more of a Rust novice... I am not doing anything unusual - users are all joining a single room and publishing in that room. I am using Chrome.
OK I took a stab at updating the switchboard to deal with the case where a user is in more than one room (but has one publisher in a room.) Unlikely to fix your issue, but figured it was worth a shot since I didn't see any other obvious problems with the old version. Can you test this commit: 7a28b8f5edac16f00468e1082463133641a7ac85
Also, just checking, but is there a chance you are using the same user_id for different sessions?
For our user case we use a one-time session id for the user id in janus SFU, but if you are using the surrogate key for an account, for example, could expose the issue even if you are connecting in different browser tabs.
I am using a different user_id for each session. I will test the commit in the morning
great thanks!
I still see this issue when I use 7a28b8f
ok. would you mind walking me through what you are doing in more detail? this will be easier for me to address if i am able to reproduce the problem. as i mentioned, in our load testing with the new plugin, did not see any publisher failures across many runs with many connections.
I was able to replicate it by opening many windows of tiny.html from the SFU testing client, pointed at a local Janus through WSS. It seems to happen when you quickly open many windows while simultaneously closing a few of those (might also work with just opening a lot of windows but I have had more luck triggering it by doing this). Note that it even if you wait awhile after it starts triggering and open a new window, you still get the error, so I don't think it is a race from seeing occupants - attempting to subscribe, something else is going on.
thanks for that info. i'll see how far i can get debugging it. for now, i guess just run the working version. odds are i will have to wave the white flag and roll this back :) maybe @mqp can give some insights to where my patch has gone wrong.
oh here's an idea, the other PR I opened introduced multithreaded message handling, perhaps we are seeing problems from that - ie maybe the new code in this recent PR is not thread safe for some reason. you can test this by setting message_threads = 1
in the plugin cfg. mind trying?
I still get the error when message_threads = 1 in the plugin cfg
ok thanks, good to know.
I'm pretty surprised, too, with message_threads = 1
. I don't see anything obvious wrong with the new get_publisher
implementation, so I assumed it was some sneaky race condition.
Okay, pretty sure I found the problem. See my comment in #50.
Just started noticing this today, using Janus 0.9.0 and master of the SFU. When the client joins a room, it gets a list of all of the users - e.g. {a, b, c, d, e, f}. However, they may fail to subscribe to users {b, c, d} giving the above error, whereas other clients may fail to subscribe to {d, e, f} or other subsets - so I know that the users have not left (I also get a message when they leave). Perhaps this is related to the recent code change that looks up the publishers?