Restarting the server on the fly ignores set levels and mixes everybody at 100%

dingodoppelt commented 3 years ago

Describe the bug Restarting the server sets levels to 100%

To Reproduce restart a running server with clients on it and notice how it ignores the set gain levels and mixes everybody at 100%

Expected behavior the server should mix everybody at the individually set levels

Operating system linux, windows, macos

Version of Jamulus 3.6.2 and before

gilgongo commented 3 years ago

3.6.2 and before

Just a to clarify - does that imply you've reproduced this in earlier versions, or just by inference?

dingodoppelt commented 3 years ago

@gilgongo: not everybody was on 3.6.2 when we noticed (around 17 clients on the server). and i didn't ask for every version because i think the code concerning the levels hasn't been touched in the last few releases (correct me, if i'm wrong)

hoffie commented 3 years ago

I don't think the server has any way to store this information (currently). So the two options to improve this would be:

Store this state server-side and reload it upon reconnect. This may have the disadvantage that we could only match on IP and/or name. At least the name may well be not that unique ("No Name")
Detect the server reconnection on the client side and make the client resend the mixing wishes to the server

gilgongo commented 3 years ago

Also, when you say "restarting" do you mean a kill -s SIGUSR1/2 as part of recording, or a kill -HUP (or stop/start)?

dingodoppelt commented 3 years ago

@gilgongo by "restarting" i mean restarting the systemd service: "systemctl restart jamulus.service" @hoffie i'd go for the client side implementation since it can already send levels but doesn't seem to do it on a restart. would we need another protocol message signaling a reconnect? i think "restarting" means quitting and starting before the clients time out so a "restart" scenario isn't handled. could it be the client doesn't notice it is being reconnected (because it doesn't get disconnected) and therefore doesn't update the fader gains? if the server made a request to the client to send the gains on connect it should work even on a restart.

pljones commented 3 years ago

UDP is a connectionless protocol. Stopping the server cannot be directly detected by the client. One of two things needs to happen: either the server says "I'm disconnecting you" or the client says "It's been rather a long time since I heard from the server". The former only happens with --discononquit. The latter is a generous amount of time.

When the server restarts, the server has no way - currently - of knowing what clients were connected. It just waits for them to start talking to it again. Hence the generous amount of time they give it to recover.

What it sounds like we need here then is:

A protocol message from the server to the client to say "Hey, I've not seen you before, even if you think I have - send state!"
The client to understand that message and do so.

Now, there may be an existing message the server sends to the client when it first hears from it - can't remember... but I don't think the client's prepared to suddenly be told "You need to send state" when it thinks everything's okay.

gene96817 commented 3 years ago

Regarding "a generous amount of time".... the server is always sending the audio mix... this can serve as a keep-alive... the server is up unless there is no audio mix output

I am assuming the audio mix output exists as long as there is at least one user connected... If it is possible for all users to be quiet and there is no mix output, then the server has to a periodic quiet channel keep-alive message, saying the channel is quiet and all is well,

I agree that the server needs a "send state" request if it receives a message from an unexpected client.

pljones commented 3 years ago

Regarding "a generous amount of time".... the server is always sending the audio mix... this can serve as a keep-alive... the server is up unless there is no audio mix output

I'm pretty sure that's what's used -- audio is the only thing that's meant to run constantly. If it stops, you know the server's "not there"... but it could be one packet, two packets... twenty packets... buffer now seriously under running... Cloud servers sometimes do get swapped out by long running high priority tasks on the same metal - you'll see that "ping" suddenly go from 30ms to 500ms (or the maximum reported)... but the client hangs on in there, waiting for things to settle. And often after a few tens of seconds, they do. The server hasn't noticed anything (maybe a few dropped packets on its side, too, of course).

The problem only comes when the process terminates - i.e. the server stops existing. That doesn't normally happen. It's either a deliberate act (i.e. "stop the server", perhaps as part of "restart the server") or it's due to a crash. Some crash-type failures may result in an automatic restart.

But yes, simple solution: server tells client "No, no, believe me: I don't know you (any more)" and so client sends state.

hoffie commented 3 years ago

I'm wondering if we could (ab)use some existing side effect in the protocol to detect a server restart, e.g.:

Using the Welcome message (a chat message with WELCOME_MESSAGE_PREFIX); disadvantage: would only work on servers with a welcome message
Attempting to detect some packet counter wrap; not sure if reliable enough, haven't checked

gene96817 commented 3 years ago

We should select a max time with no server transmissions to declare the server down and restart. It is almost arbitrary.

@hoffie When the server restarts, there will be no destination to send a restart message unless the server has some persistent data of connected clients just prior to the restart. What information persist across a restart?

hoffie commented 3 years ago

When the server restarts, there will be no destination to send a restart message unless the server has some persistent data of connected clients just prior to the restart.

Yes, that's true at the time of the restart (and it will be true for all other proposed detection mechanisms). The reason why it could work is that clients will "reconnect" quickly (technically: they will probably not notice that they had not been connected and will continue to behave as if nothing happened).

We would explicitly be making use of the fact that the server noticed knows that it was restarted. We would only have to make clients recognize this an act like it was a fresh connection.

Have you tried it? In a quick test, restarting a server will cause a new welcome message to be sent to existing clients.

pljones commented 3 years ago

I think there are some things the client "knows" to tell the server on a fresh connect and some things the server knows to tell the client, currently. What we have here is the server "knowing" it's a fresh connect -- but the client not having realised.

So we need a way to make it clear to the client "something happened you weren't expecting" and it then needs to act appropriately.

hoffie commented 3 years ago

Looks like there are already several opportunities in the server code which signal the client that it was reconnected (= is considered to be a new client by the server):

https://github.com/jamulussoftware/jamulus/blob/7d0801a1578afc41803ff30a33327ed5e38d8269/src/server.cpp#L593

pljones commented 3 years ago

Interestingly, some of the messages explicitly allows for getting into an unexpected state https://github.com/jamulussoftware/jamulus/blob/7d0801a1578afc41803ff30a33327ed5e38d8269/src/server.cpp#L605-L610,619-626

So the problem is reduced -- the server knows to try to poke at the client for the information it needs, it's just the client side. What does it decide to do on a ClientIDMes arriving? Seems to be this... https://github.com/jamulussoftware/jamulus/blob/7d0801a1578afc41803ff30a33327ed5e38d8269/src/client.cpp#L710 which just triggers the UI to do something (probably not resend current values).

Potentially CClient::OnClientIDReceived could do the work but I'm wary as the style seems not to follow that idiom...

passing commented 3 years ago

since this a very rare event: can't the server just (re)start with all channels on 0% until it gets the actual values from the clients? (assuming the clients will periodically send their levels)

pljones commented 3 years ago

(assuming the clients will periodically send their levels)

They don't.

hoffie commented 3 years ago

Another thought after reading some code in this area: Could it be that the necessary code for sending the proper fader levels to the server is there, but is not triggered due to the following mismatch between server and client state?

Server is running, Client A and Client B are connected.
Server is restarted and stops knowing Client A and Client B.
Server sees traffic from Client A, assumes it has just connected and will perform the connection "handshake".
Server sees traffic from Client B, assumes it has just connected and will perform the connection "handshake". In addition, the server will inform Client A that B connected and will inform B that A was already there.
Client A ignores this information because its state does not indicate any change (it knew Client B already). It will therefore not send the intended gain level to the server. Same for client B.

If this is true, the fix might be a change in CAudioMixerBoard::ApplyNewConClientList. This still has to be verified (Wireshark?) and checked regarding possible side effects.

pljones commented 3 years ago

because its state does not indicate any change

If so, that shouldn't happen - I think any such check could be removed. I don't think that's it, though - I think the client only sends on connect or on change. Because the client's have no need to pass each other level information (with the exclusion of the recent addition of the "has me muted" indicator).

hoffie commented 3 years ago

If so, that shouldn't happen - I think any such check could be removed. I don't think that's it, though - I think the client only sends on connect or on change. Because the client's have no need to pass each other level information (with the exclusion of the recent addition of the "has me muted" indicator).

Yeah, my suspicion was that the highlighted changed is not properly noticed by the client. The following test seems to contradict my suspicion:

Set up tcpdump -C 32 -i lo -nn -p -s0 -w /tmp/jam.pcap '(ip[6:2]&0x3fff) != 0 or (udp portrange 22120-22139 and (udp[8:2] == 0 and udp[4:2]-17 == (udp[14]<<8)+udp[13]))'
Run a Jamulus server on localhost
Run a Jamulus client (Client_A) on localhost and connect to the server
Run a Jamulus client (Client_B) on localhost and connect to the server
Stop the server
Wait some seconds
Start the server
Stop tcpdump

I then analyzed the resulting dump (thanks a lot to @softins for the tcpdump filter example and the wireshark dissector. This is a huge help!).

Feb 14, 2021 12:41:29.978483000 CET 22124   22218   CLIENT_ID (0)
Feb 14, 2021 12:41:29.991517000 CET 22218   22124   CHANNEL_INFOS (Germany, None, Not_Set, Client_B)
Feb 14, 2021 12:41:30.206784000 CET 22218   22124   CHANNEL_INFOS (Germany, None, Not_Set, Client_B)
Feb 14, 2021 12:41:30.219947000 CET 22218   22124   CHANNEL_GAIN ([0] => 100%)
Feb 14, 2021 12:41:30.220012000 CET 22218   22124   CHANNEL_GAIN ([0] => 3%)
Feb 14, 2021 12:41:37.570514000 CET 22124   22217   CLIENT_ID (1)
Feb 14, 2021 12:41:37.584326000 CET 22217   22124   CHANNEL_INFOS (Germany, None, Not_Set, Client_A)
Feb 14, 2021 12:41:37.585077000 CET 22218   22124   CHANNEL_GAIN ([1] => 100%)
Feb 14, 2021 12:41:37.587252000 CET 22218   22124   CHANNEL_GAIN ([1] => 3%)
Feb 14, 2021 12:41:37.781320000 CET 22217   22124   CHANNEL_INFOS (Germany, None, Not_Set, Client_A)
Feb 14, 2021 12:41:37.800390000 CET 22217   22124   CHANNEL_GAIN ([0] => 100%)
Feb 14, 2021 12:41:37.800470000 CET 22217   22124   CHANNEL_GAIN ([0] => 100%)
Feb 14, 2021 12:41:37.800518000 CET 22217   22124   CHANNEL_GAIN ([1] => 24%)

# Server restart

Feb 14, 2021 12:41:48.214739000 CET 22124   22217   CLIENT_ID (0)
Feb 14, 2021 12:41:48.214865000 CET 22124   22218   CLIENT_ID (1)
Feb 14, 2021 12:41:48.215131000 CET 22217   22124   CHANNEL_INFOS (Germany, None, Not_Set, Client_A)
Feb 14, 2021 12:41:48.215287000 CET 22218   22124   CHANNEL_INFOS (Germany, None, Not_Set, Client_B)
Feb 14, 2021 12:41:48.215468000 CET 22217   22124   CHANNEL_GAIN ([0] => 24%)
Feb 14, 2021 12:41:48.215776000 CET 22218   22124   CHANNEL_GAIN ([0] => 3%)
Feb 14, 2021 12:41:48.217994000 CET 22217   22124   CHANNEL_GAIN ([1] => 100%)
Feb 14, 2021 12:41:48.219890000 CET 22218   22124   CHANNEL_GAIN ([1] => 3%)

So... protocol-wise everything seems fine to me in this case: The clients do properly send their intended gain levels to the server. I did not verify how the audio behaves in this case, but I think the test shows that:

Either everything works fine or I'm missing some detail for it to break.
Or it only breaks with certain chance / timing (maybe my restart was too slow?)
Or the breakage is within Jamulus server state (i.e. the CHANNEL_GAIN messages are not processed properly). I will look into this part again.

Edit: Attached the pcap, in case someone wants to check themselves. jam.pcap.log

Edit 2: Seems like I can reproduce the issue partially: After a server restart, sometimes only one CHANNEL_GAIN message per client can be observed. This would explain the behavior from the issue description although I'm still unsure about the reason. Timing may play a role, repeated restarts may play a role. Will continue investigating as time permits.

hoffie commented 3 years ago

By now I'm pretty confident that my hypothesis is correct, i.e. that

that Jamulus server already sends the necessary protocol messages in order to get clients to send their gain levels,
that Jamulus clients already have the necessary code to send the gain levels to the server in such cases and
that this currently does not work reliably on server restarts because the audio mixer board will not send gain levels to the server if it thinks that nothing has changed as seen here: https://github.com/jamulussoftware/jamulus/blob/6d79bf1b0c8a1fc1eb9bf5a4d9b1eb70d6cbca44/src/audiomixerboard.cpp#L1178

So, essentially, this is a race condition:

If, after a server restart, clients reconnect in exactly the same order as they did initially, they will get the same client ids, no name change will be triggered in the audio mixer board and updated gain levels will not be sent. No client will be mixed at the right level.
If, after a server restart, clients reconnect in a slightly different order than they did initially, some will get new client ids, a name change is detected and updated gain levels are sent. Some clients will be mixed at the right level.
If, after a server restart, all clients reconnect in a different order than they did initially, everyone will get a new client id, a name change is detected and updated gain levels are sent for all clients. Therefore, all clients will be mixed at the right level.

Two possible fixes come to mind:

Client-side: Always resend gain levels when CONN_CLIENTS_LIST updates are received from the server. #1009 implements that.
Server-side: Send connecting clients an empty CONN_CLIENTS_LIST in order to signal a reconnect and make them reset their local channel list. The next (regular) CONN_CLIENTS_LIST message will then be handled as if those channels were new, i.e. gain levels would be properly updated. #1010 implements that.

I have verified this with the following tests: Test 1: Vanilla Jamulus client/server, reconnect in initial order test1-vanilla-reconnect-in-normal-order.pcap.log

Run a Jamulus server
Connect Client A
Connect Client B
Prevent Client B from connecting (iptables -I OUTPUT -p udp --sport 22218 -j DROP)
Restart Server
Let Client B reconnect (iptables -D OUTPUT -p udp --sport 22218 -j DROP)
Observe that both clients get their initial client ids and that no proper gain level updates are sent.

Test 2: Vanilla Jamulus client/server, reconnect in opposite order test2-vanilla-reconnect-in-reverse-order.pcap.log

Run a Jamulus server
Connect Client A
Connect Client B
Prevent Client A from connecting (iptables -I OUTPUT -p udp --sport 22217 -j DROP)
Restart Server
Let Client A reconnect (iptables -D OUTPUT -p udp --sport 22217 -j DROP)
Observe that both clients get a new client id and that proper gain level updates are sent.

Test 3: Vanilla Jamulus server with modified Jamulus client (PR #1009), reconnect in initial order test3-modified-client-reconnect-in-normal-order.log

Run a Jamulus server
Connect modified Client A
Connect modified Client B
Prevent Client B from connecting (iptables -I OUTPUT -p udp --sport 22218 -j DROP)
Restart Server
Let Client B reconnect (iptables -D OUTPUT -p udp --sport 22218 -j DROP)
Observe that both clients get their initial client ids and that still proper gain level updates are sent.

Test 4: Vanilla Jamulus client with modified Jamulus server (PR #1010), reconnect in initial order test4-modified-server-reconnect-in-normal-order.pcap.log

Run a modified Jamulus server
Connect Client A
Connect Client B
Prevent Client B from connecting (iptables -I OUTPUT -p udp --sport 22218 -j DROP)
Restart Server
Let Client B reconnect (iptables -D OUTPUT -p udp --sport 22218 -j DROP)
Observe that both clients get their initial client ids and that still proper gain level updates are sent.

To move this forward, we would have to do the following:

Ensure that the one of the fixes really solves the problem when listening to audio (I have only done tests at the protocol level). Can you try both, @dingodoppelt?
Decide which fix to use.
Ensure that there are no unintended side effects if going forward with the client-side fix:
- The initial commit which introduced the problematic client-side logic was 471fb0c6a, so I don't see a specific reason (based on Git history) why the code is like it is. I suspect that it was considered an optimization to avoid updating gain levels if (seemingly) nothing had changed.
- One side effect will be that updated gain levels are sent more often (each time a new client list is distributed to the client). I don't think that this will happen really often and I don't think there is a problem if this causes some redundant protocol messages as the overhead should be small.
Ensure that there are no unintended side effects with the server-side fix (i.e. it must not break (old) clients in any way).
Actual tests should be done with both approaches.
Can someone familiar with the protocol/code judge the different approaches (@softins, @pljones, @ann0see)?

I suggest that general discussion on this issue is kept here while discussion about correctness, upsides and downsides of the proposals are directed to the relevant PRs which are linked above.

softins commented 3 years ago

Great work @hoffie! I agree with the need to do more testing, but I have examined both PRs and the changes look good.

One test is missing from the above: modified Jamulus client (PR #1009) with modified Jamulus server (PR #1010)

For maximum compatibility, my initial view is that both PRs should be applied. That would improve the issue with both older clients on a new server, and with newer clients on an old server. Provided there is no problem with new clients talking to a new server.

One side effect will be that updated gain levels are sent more often (each time a new client list is distributed to the client). I don't think that this will happen really often and I don't think there is a problem if this causes some redundant protocol messages as the overhead should be small.

I would think that the bandwidth consumed in this case is much less than that used by the audio streams, and so not significant.

pljones commented 3 years ago

That's a fair point, @softins, and one I'd not thought of -- having both may well improve matters significantly more than just the one.

Hopefully at some point the "random pan" bug that affects certain combinations of client and server will get squashed in the cross-fire of fixes like this, too.

hoffie commented 3 years ago

@softins I have done the test you suggested and it looks good (as expected). When combining both fixes, the client-side fix becomes irrelevant in the scenario. Of course, it will stay relevant when connecting to servers without the fix.

I would think that the bandwidth consumed in this case is much less than that used by the audio streams, and so not significant.

Yeah, I fully agree, just wanted to mention it. :)

Hopefully at some point the "random pan" bug that affects certain combinations of client and server will get squashed in the cross-fire of fixes like this, too.

Don't know what specific bug you refer to, but I want to add that my PRs intend to fix the gain bug mentioned in this issue, but in practice they should do even more: They will force the client to re-trigger sending all mixer properties. This should cover gain, pan, mute, solo and group (the latter two indirectly as they seem to be handled via mute protocol messages?). So, there is a real chance that the bug you are referring to may be fixed by this as well.

I guess there is agreement that more testing is needed, especially for the client-side fix where I'm really unsure about possible side effects. However, I fail to come up with useful test cases. I might just run my own Jamulus stuff with both fixes applied and will try to check if anything breaks.

Test 5: Modified Jamulus client (PR #1009) with modified Jamulus server (PR #1010), reconnect in initial order test5-modified-server-and-client-reconnect-in-normal-order.pcap.log

Run a modified Jamulus server
Connect modified Client A
Connect modified Client B
Prevent Client B from connecting (iptables -I OUTPUT -p udp --sport 22218 -j DROP)
Restart Server
Let Client B reconnect (iptables -D OUTPUT -p udp --sport 22218 -j DROP)
Observe that both clients get their initial client ids and that still proper gain level updates are sent.

houbsta commented 3 years ago

Sorry to jump on this thread, I have searched for other tickets and maybe this is closest.

Does the behaviour described mix everyone at 100% or at "new client" level? And is there a separate subject anywhere about setting default new client level to something like 70% instead of 100% (which seems a poor default)?

softins commented 3 years ago

Does the behaviour described mix everyone at 100% or at "new client" level?

I'm not the expert on this issue, but I think it must be 100%. The "new client" level is a client-side setting, but this issue is that when the server is restarted, it doesn't know each client's fader settings. I assume the 100% is an initialised default within the server before it receives the actual setting from the client.

And is there a separate subject anywhere about setting default new client level to something like 70% instead of 100% (which seems a poor default)?

Don't know, sorry. Raise it as an issue, and hopefully that will attract discussion.

hoffie commented 3 years ago

I've just gotten around to finishing tests with real audio.

I ran two Jamulus clients which connect to a local server. One client (A) fed music into the server (and has himself muted in its mix). I was listening to another Jamulus client (B). I had set the level of the music-providing client (A) really low in B's mix to spot any differences after restarts.

When I restart the server based on current master code, I can reproduce the suddenly-at-100% issue.
When I restart a server with the fix from #1010, mixing level and mute are set properly.
When I restart a server with the fix from #1010 and replace one (or both) clients with a version with #1009, mixing level and mute are set properly.
When I restart a server based on current master code and use one (or both) clients built from a version with #1009, mixing level and mute are set properly.

I have not noticed any downsides of either PRs.

I therefore agree that both PRs should be merged. Anyone hitting this issue can upgrade their server, their client or both to profit from the fix.

jamulussoftware / jamulus

Restarting the server on the fly ignores set levels and mixes everybody at 100% #955