jamulussoftware / jamulus

Jamulus enables musicians to perform real-time jam sessions over the internet.
https://jamulus.io
Other
989 stars 222 forks source link

Server performance & optimization #455

Closed WolfganP closed 3 years ago

WolfganP commented 4 years ago

Follows from https://github.com/corrados/jamulus/issues/339#issuecomment-657076545 for better focus of the discussion.

So, as the previous issue started to explore multi-threading on the server for better use of resources, I first run a profiling of the app on debian.

Special build with: qmake "CONFIG+=nosound headless noupcasename debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro && make clean && make -j

Then run as below, and connecting a couple of clients for a few seconds: ./jamulus --nogui --server --fastupdate

Once disconnecting the clients I gracefully killed the server pkill -sigterm jamulus

And finally run gprof, with the results posted below: gprof ./jamulus > gprof.txt

https://gist.github.com/WolfganP/46094fd993906321f1336494f8a5faed

It would be interesting to see those who observed high cpu usage run test sessions and collect profiling information as well to detect bottlenecks and potential code optimizations, before embarking on multi-threading analysis that may require major rewrites.

corrados commented 4 years ago

@brynalf One question to your tests: Have you changed the following define: https://github.com/corrados/jamulus/blob/master/src/global.h#L180 (#define MAX_NUM_CHANNELS) to a larger number like, e.g., 200? Otherwise the client/server do not support more than 100 connections.

brynalf commented 4 years ago

Yes. to 256 for all the tests

storeilly commented 4 years ago

Also an interesting observation, with max clients at 256 and connected clients >100 the server shows server full on the connection setup window but still accepts new connections. I thought it might be that we needed to edit another place also.

On Wed 2 Sep 2020, 20:06 brynalf, notifications@github.com wrote:

Yes. to 256

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/corrados/jamulus/issues/455#issuecomment-685938720, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIJSQL4LVMIJW5KYTQ7QCDSD2JUFANCNFSM4OXQXIXQ .

corrados commented 4 years ago

I have just commited https://github.com/corrados/jamulus/commit/c9c5e41456609bb7573d0d6577bf21c7ea4b72d7 which possibly solves https://github.com/corrados/jamulus/issues/547.

@storeilly Could you please update your server "Jam mt 26" = port 22126; multi threading enabled" with the newest Git code so I can test if the fix really solves the problem?

One more question: What is the maximum up/down stream rate of your Jam mt 26 server? When I did a test with close to 100 clients, I noticed that I had a lot of protocol retransmissions which could indicate that we hit a network traffic limit.

corrados commented 4 years ago

with max clients at 256 and connected clients >100 the server shows server full on the connection setup window but still accepts new connections.

That is an interesting effect. I have checked in the code but I could not find an obvious error causing this behavior. In the newest Jamulus client the maximum number of allowed clients is shown (e.g. 2/10). Was the number after the "/" correctly shown? So, as a result, just the "full" text was shown too early?

Edit: I just saw in the code that if "(full)" is shown, the maximum number of supported clients is not shown, i.e. the number after the "/".

storeilly commented 4 years ago

@storeilly Could you please update your server "Jam mt 26" = port 22126; multi threading enabled" with the newest Git code so I can test if the fix really solves the problem?

Thanks @corrados I have added a new instance Jam 69c5e41 31 built with this tag, edited the version and the max users as before to 255, run with -T on port 22131. I have also implemented tcpdump as per notes from @softins (Thank you very much Tony) so just let me know the dates and times you want the capture files and I'll post them here. (or send them wherever)

I've left Jam mt 26 running on port 22126 if you want to compare experience

One more question: What is the maximum up/down stream rate of your Jam mt 26 server? When I did a test with close to 100 clients, I noticed that I had a lot of protocol retransmissions which could indicate that we hit a network traffic limit.

The limit on the machine is 7TB according to AWS Lightsail. The only instances on the machine are the ones you see from port 22124-22131 so keep an eye in case someone is load testing the other instances at the same time, not that that is likely to ever hit 7TB!!!

Thanks again sir! S

corrados commented 4 years ago

Thank you!

I have added a new instance Jam 69c5e41 31 built with this tag

I cannot find tag 69c5e41. The commit ID of my change was c9c5e41. Since I changed something today, the last commit on master is ef0c5d7. Is this 69c5e41 a commit on your fork?

The limit on the machine is 7TB according to AWS Lightsail

I assume the 7TB is the total amount of traffic you can have per month. What I am interested in is something different. It is the maximum stream rate you can have per up-/downstream. E.g., on my virtual server I have 100 Mbps for upstream. So I need a number with the unit "Mbps". Do you know that number for your server?

storeilly commented 4 years ago

Doh!! no that is a typo, see git status below, sorry, I typed the server name manually!!

HEAD detached at c9c5e414 Changes not staged for commit: (use "git add ..." to update what will be committed) (use "git checkout -- ..." to discard changes in working directory)

    modified:   Jamulus.pro
    modified:   src/global.h
storeilly commented 4 years ago

I assume the 7TB is the total amount of traffic you can have per month. What I am interested in is something different. It is the maximum stream rate you can have per up-/downstream. E.g., on my virtual server I have 100 Mbps for upstream. So I need a number with the unit "Mbps". Do you know that number for your server?

You are correct, I can't find that in the docs so have run a speedtest from the server... dissapointing results below!!!!! ubuntu@ip-xxx-xxx-xxx-xxx:~$ speedtest Retrieving speedtest.net configuration... Testing from Amazon.com (52.49.128.29)... Retrieving speedtest.net server list... Selecting best server based on ping... Hosted by Airspeed Telecom (Dublin) [1.78 km]: 1.74 ms Testing download speed................................................................................ Download: 971.57 Mbit/s Testing upload speed...................................................................................................... Upload: 4.17 Mbit/s

maallyn commented 4 years ago

Here is the speedtest from the server that I have been doing my testing on, which is newark-music.allyn.com at the Linode data center in Newark, NJ in the U.S.

root@bicycle-bellingham:~# speedtest Retrieving speedtest.net configuration... Retrieving speedtest.net server list... Testing from Linode (172.104.29.25)... Selecting best server based on latency... Hosted by Kansas Research and Education Network (Wichita, KS) [43.14 km]: 57.39 ms Testing download speed........................................ Download: 1161.19 Mbit/s Testing upload speed.................................................. Upload: 86.58 Mbit/s

I hope this helps.

storeilly commented 4 years ago

Ok update.. better numbers with latest version of speedtest.....

Retrieving speedtest.net configuration... Testing from Amazon.com (x.x.x.x)... Retrieving speedtest.net server list... Selecting best server based on ping... Hosted by Airspeed Telecom (Dublin) [1.78 km]: 1.973 ms Testing download speed................................................................................ Download: 930.18 Mbit/s Testing upload speed................................................................................................ Upload: 942.43 Mbit/s

corrados commented 4 years ago

Upload: 942.43 Mbit/s

That number looks very good :-). Ok, so now we know that we are not bandwidth limited but CPU limited if I see issues with the protocol when a lot of clients are connected. I assume that the socket thread is blocked by the audio threads so that it cannot handle any protocol messages in time anymore.

Yesterday I did a test where I used 100 clients with mono (instead of stereo) and that worked fine. I noticed that the jitter was increased which forced my jitter buffers to go higher than usual (if only one client is connected, it goes to 2/2 and with 100 mono clients connected it goes to 5/5) but still the audio was perfect. So it would be interesting if we could reproduce this with a real-live session. Would it be possible to get a choir of about 100 musicians trying out your multi-threaded Jamulus server if a choir rehearsal is possible when all clients use mono?

corrados commented 4 years ago

Doh!! no that is a typo

I tried your new server today and I could see the same issue https://github.com/corrados/jamulus/issues/547 on that new version as well. So unfortunately my change did not solve the issue. So I'll investigate further now.

storeilly commented 4 years ago

Doh!! no that is a typo

I tried your new server today and I could see the same issue #547 on that new version as well. So unfortunately my change did not solve the issue. So I'll investigate further now.

I've captured the jamulus log and the tcpdump files covering the connection time here if it helps. https://www.dropbox.com/sh/9dwy7h97lvf1l9q/AAAZXpn7YaOTeIlP15zZo6ixa?dl=0 @softins

softins commented 4 years ago

I tried your new server today and I could see the same issue #547 on that new version as well. So unfortunately my change did not solve the issue. So I'll investigate further now.

I've captured the jamulus log and the tcpdump files covering the connection time here if it helps.

I've fetched those files and just having a look...

corrados commented 4 years ago

@storeilly Could you please do another test if possible? I have to find out if the protocol timer causes the issue. Could you please uncomment the following line: TimerSendMess.stop(); -> https://github.com/corrados/jamulus/blob/master/src/protocol.cpp#L481

So that the resulting code looks like this:

    Mutex.lock();
    {
        // we have to check that list is not empty, since in another thread the
        // last element of the list might have been erased
        if ( !SendMessQueue.empty() )
        {
            vecMessage.Init ( SendMessQueue.front().vecMessage.Size() );
            vecMessage = SendMessQueue.front().vecMessage;

            // start time-out timer if not active
            if ( !TimerSendMess.isActive() )
            {
                TimerSendMess.start ( SEND_MESS_TIMEOUT_MS );
            }

            bSendMess = true;
        }
        else
        {
            // no message to send, stop timer
// TimerSendMess.stop();
        }
    }
    Mutex.unlock();
storeilly commented 4 years ago

@storeilly Could you please do another test if possible? I have to find out if the protocol timer causes the issue. Could you please uncomment the following line: TimerSendMess.stop(); -> https://github.com/corrados/jamulus/blob/master/src/protocol.cpp#L481

So that the resulting code looks like this:

    Mutex.lock();
    {
        // we have to check that list is not empty, since in another thread the
        // last element of the list might have been erased
        if ( !SendMessQueue.empty() )
        {
            vecMessage.Init ( SendMessQueue.front().vecMessage.Size() );
            vecMessage = SendMessQueue.front().vecMessage;

            // start time-out timer if not active
            if ( !TimerSendMess.isActive() )
            {
                TimerSendMess.start ( SEND_MESS_TIMEOUT_MS );
            }

            bSendMess = true;
        }
        else
        {
            // no message to send, stop timer
// TimerSendMess.stop();
        }
    }
    Mutex.unlock();

I think you mean "comment line 481" not "uncomment".. Edited code below... and server running now on port 22132

void CProtocol::SendMessage() { CVector vecMessage; bool bSendMess = false;

Mutex.lock();
{
    // we have to check that list is not empty, since in another thread the
    // last element of the list might have been erased
    if ( !SendMessQueue.empty() )
    {
        vecMessage.Init ( SendMessQueue.front().vecMessage.Size() );
        vecMessage = SendMessQueue.front().vecMessage;

        // start time-out timer if not active
        if ( !TimerSendMess.isActive() )
        {
            TimerSendMess.start ( SEND_MESS_TIMEOUT_MS );
        }

        bSendMess = true;
    }
    else
    {
        // no message to send, stop timer

// TimerSendMess.stop(); } } Mutex.unlock();

if ( bSendMess )
{
    // send message
    emit MessReadyForSending ( vecMessage );
}

}

corrados commented 4 years ago

Wow, that was quick, thanks :-)

I think you mean "comment line 481" not "uncomment".

Yes, it should read "comment".

I just did a test and the issue is still there even with your modified code. So now I know that the timer is not the issue. So I have to look at another place for finding the real cause of the issue.

I've fetched those files and just having a look...

Let's see if you can find something which could explain the cause of the issue.

softins commented 4 years ago

Not quite sure what I'm looking for. What is the symptom? Currently looking at the test from 6:39 BST this morning (07:39 in Germany).

corrados commented 4 years ago

Symptom: I first connect to the server with my single Windows client. Then I start 58 dummy clients on my Linux server so the total number of clients at the server is 59. On my Windows client, I can see that for two mixer faders there are no name tags received. When I disconnect the 58 dummy clients, the mixer panel on my Windows client still shows all faders. If I type in a chat text in the Windows client, I do not get a response. But when I connect with another Windows client, that second Windows client receives the chat text. So for the first Windows client it seems that it cannot receive any more protocol messages after the 58 dummy clients had connected to the server.

Note that when the first Windows client sends protocol messages to the server, I get an acknowledge packet back. So that protocol path client->server seems to work but the path server->client does no longer work.

softins commented 4 years ago

Great, thanks. I can see the single windows client. I'll concentrate on the traffic between it and the server, and let you know if I find any clues.

softins commented 4 years ago

Is it possible in your dummy clients to provide a different name for each client connection? At the moment, they are all "dummytest". If a serial number could be added, it might be useful for identifying whether a missing name is due to a server issue or to a missing CHANNEL_INFOS from the client.

I observed that as soon as a client starts sending audio (as Jamulus does not have a CONNECT message), the client is added to the list and a new CONN_CLIENTS_LIST is sent to each client, with an empty name for the new client. Once a CHANNEL_INFOS is received from the client, the name is filled in and the CONN_CLIENTS_LIST is resent. So when it first got to 59 clients, the CONN_CLIENTS_LIST send to the Windows client was only 1098 bytes long, and many of the clients had no name. It kept getting sent again, with longer and longer packets as the names got filled in. Eventually it settled at a packet length of 1512 and didn't increase after that. Examining one of those revealed that client 43 still had no name. Without distinct client names I can't tell whether that was due to a dropped client message or a server processing error.

storeilly commented 4 years ago

I'm leaving the tcpdump running with smaller file size of 32MB for easier uploading. If you want to do another "hit" just let me know what time you did it and I'll upload the files to the same dropbox folder. I've put a few more there tonight along with the logfile from server 9

softins commented 4 years ago

At 05:39:43 UTC, while the server still said it had 59 clients connected, the server received sixteen CLM_DISCONNECTION messages from the Windows client over a period of 100ms. This reduced the client count to 58. The server didn't start to receive CLM_DISCONNECTION messages from the Linux clients until 9 seconds later, at 05:39:52 UTC. A lot of disconnect messages were received over the next 2 seconds. No CONN_CLIENTS_LIST messages were sent by the server after that (as you would expect).

I wonder whether it would help reduce the amount of traffic if the following technique were used:

  1. When the channel list has changed, if a client has not received a list in the last 1 second (timer not running), send the list immediately, and start a client-specific timer for 1 second.
  2. If the timer is running when an update to the channel list has happened, set a flag for that client to say the client list has changed.
  3. When the timer expires, if the flag is set, send a clients list now and restart the timer. Don't start the timer if the flag was not set. This way, the client would only receive one update every 1 second, if many clients were joining or leaving at once. This could be tweaked to 500ms or something if preferred.

The above has been edited to improve on my original suggestion

softins commented 4 years ago

I'm leaving the tcpdump running with smaller file size of 32MB for easier uploading. If you want to do another "hit" just let me know what time you did it and I'll upload the files to the same dropbox folder. I've put a few more there tonight along with the logfile from server 9

@storeilly if you are running multiple servers on that machine with different port numbers, maybe you could also modify the tcpdump to capture only the port of the test server? You appear to have several other Jamulus servers running on different ports.

WolfganP commented 4 years ago

Is it possible in your dummy clients to provide a different name for each client connection? At the moment, they are all "dummytest". If a serial number could be added, it might be useful for identifying whether a missing name is due to a server issue or to a missing CHANNEL_INFOS from the client.

As per https://github.com/corrados/jamulus/issues/530#issuecomment-677279884 , the client name cannot be customized from the command line (--clientname only has effect on the client window title and jack connection name). I proposed to override the profile name if --clientname was specified at the command line at https://github.com/corrados/jamulus/pull/162#issuecomment-682006515 exactly with this automated testing use case in mind.

Anyways, the client/profile name being submitted to the server is read from the ini file, and is base64 encoded (which means temp ini files can be automatically generated by an script, but it will require additional shell trickery :-)

<name_base64>UFHHLIBwYw=</name_base64>

storeilly commented 4 years ago

@storeilly if you are running multiple servers on that machine with different port numbers, maybe you could also modify the tcpdump to capture only the port of the test server? You appear to have several other Jamulus servers running on different ports.

@softins, they are all different variations on the multi-threaded builds, they are all test servers really. @corrados can I shut any of these down if they are no longer useful in the performance and optimization quest? I'm happy to leave them if needed as reference. They are https://github.com/corrados/jamulus/issues/455#issuecomment-682129226 and the two additional servers Jam c9c5e41 31 port 22131 build c9c5e41 Jam c9c5e41_e 32 port 22132 build c9c5e41 with edits as per https://github.com/corrados/jamulus/issues/455#issuecomment-687629236

I can move them to my private chain but will need to give you in a PM my central IP!

corrados commented 4 years ago

I shut any of these down if they are no longer useful in the performance and optimization quest?

I was using "Jam st 24" = port 22124; single threaded and Jam mt 26" = port 22126; multi threading enabled for my tests. For debugging the bug, this is sufficient for now.

I wonder whether it would help reduce the amount of traffic if the following technique were used

Sure you can reduce the protocol traffic by some smart implementations. I already started doing this with, e.g., this commit: https://github.com/corrados/jamulus/commit/bdd2c09b6e1dfa48210c8ad165d60272c0ac2ae9 But we should not mix things here. We should focus on the protocol issue for now since Jamulus get's in a state where the protocol mechanism get's stuck and does not recover until a re-connect. This is a bug and should be solved. After that bug is solved, we can look into optimizing the protocol if necessary.

If a serial number could be added, it might be useful for identifying whether a missing name is due to a server issue or to a missing CHANNEL_INFOS from the client.

I am pretty sure it is just the protocol communication between the server and the Windows client. If I connect with another client to the same server just after all 58 dummy had connected, the audio mixer board is completely shown will all names.

Eventually it settled at a packet length of 1512 and didn't increase after that. Examining one of those revealed that client 43 still had no name.

I assume you are talking about the communication between the server and this single Windows client, right? It seems at some point the server does not send any more protocol messages to that Windows client. It would be good to see what happens at that point. This packet with size 1512, is this repeated over and over again? So that it seems the client cannot receive this packet?

I think what happens for the server-to-Windows-client communication at that moment is the key to understand what goes wrong.

corrados commented 4 years ago

@storeilly I have implemented some extensive logging for the protcol now. I have created a new branch for it: https://github.com/corrados/jamulus/tree/logging_for_bugfix and also a new tag for it: extlogging (https://github.com/corrados/jamulus/releases/tag/extlogging). The additional log goes in the normal Jamulus log file and looks like this:

0: Enqueue iID: 32, iCnt: 0, wasempty: 1
0: SendMessage iID: 32, iCnt: 0
0: Enqueue iID: 21, iCnt: 1, wasempty: 0
0: Enqueue iID: 11, iCnt: 2, wasempty: 0
0: Enqueue iID: 23, iCnt: 3, wasempty: 0
0: Enqueue iID: 29, iCnt: 4, wasempty: 0
0: Enqueue iID: 33, iCnt: 5, wasempty: 0
2020-09-06 14:23:20, 127.0.0.1, connected (1)
0: Ackn iID: 32, iRecCounter: 0, bSendNextMess: 1
0: SendMessage iID: 21, iCnt: 1
0: Enqueue iID: 10, iCnt: 6, wasempty: 0
0: Ackn iID: 21, iRecCounter: 1, bSendNextMess: 1
0: SendMessage iID: 11, iCnt: 2
0: Ackn iID: 11, iRecCounter: 2, bSendNextMess: 1
0: SendMessage iID: 23, iCnt: 3
0: Enqueue iID: 24, iCnt: 7, wasempty: 0
0: Ackn iID: 23, iRecCounter: 3, bSendNextMess: 1
0: SendMessage iID: 29, iCnt: 4
0: Ackn iID: 29, iRecCounter: 4, bSendNextMess: 1
0: SendMessage iID: 33, iCnt: 5
0: Enqueue iID: 24, iCnt: 8, wasempty: 0
0: Ackn iID: 33, iRecCounter: 5, bSendNextMess: 1
0: SendMessage iID: 10, iCnt: 6
0: Ackn iID: 10, iRecCounter: 6, bSendNextMess: 1
0: SendMessage iID: 24, iCnt: 7
0: Ackn iID: 24, iRecCounter: 7, bSendNextMess: 1
0: SendMessage iID: 24, iCnt: 8
0: Ackn iID: 24, iRecCounter: 8, bSendNextMess: 1
0: Enqueue iID: 10, iCnt: 9, wasempty: 1
0: SendMessage iID: 10, iCnt: 9
0: Ackn iID: 10, iRecCounter: 9, bSendNextMess: 1
2020-09-06 14:23:21,, server idling -------------------------------------

It would be gread if you could compile this version and setup a new server like you did with, e.g., the "Jam 69c5e41 31". Thanks in advance.

storeilly commented 4 years ago

1b16dc38 is on port 22133 tcpdump only collecting now on 22133 for now. Older logs and files now uploaded to dropbox

corrados commented 4 years ago

Great :-). I just did a test run on that new server. Could you please upload the log file which was generated by that Jamulus server?

storeilly commented 4 years ago

jamulus10.log Done

softins commented 4 years ago

tcpdump only collecting now on 22133 for now.

Cool :)

jamulus10.log Done

I'll fetch this and have a look at it.

storeilly commented 4 years ago

jamulus-proto-20200906-133725.pkt now up there too

corrados commented 4 years ago

That is very interesting. If you look at the end of the log file you can see:

0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: Enqueue iID: 18, iCnt: 128, wasempty: 0
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: Enqueue iID: 18, iCnt: 129, wasempty: 0
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: Enqueue iID: 18, iCnt: 130, wasempty: 0
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
0: SendMessage iID: 24, iCnt: 70
2020-09-06 13:59:18,, server idling -------------------------------------

That is when all dummy clients are already disconnected. The server tries to get the message 24 (connected clients list) to the Windows client but never get's an acknowledge packet back. During that time I wrote some chat text messages (ID 18) and you can see that they are correctly enqueued as expected. So what I can see from the log is that the Jamulus protocol works ok.

@softins Can you please check in the wireshark dump if that packet for ID 24 is sent over and over again as I can see in my log given above?

softins commented 4 years ago

Yes, that's what is happening. The CONN_CLIENTS_LIST with cnt=69 has length 1500 and is acked. In that list, clients 54 and 55 have not had their name filled in yet. When client 54's name is filled in, the list gets sent again, but is now 1509 bytes in length. It has cnt=70 and never gets asked. My suspicion is that the packet never arrives at the windows client. Either it is getting fragmented en route and fragments get dropped, or it is just too large for an MTU somewhere in the path and gets dropped. 1509 is just over the 1500 byte limit for Ethernet frames.

Now, if instead I look at the traffic to/from a particular test client, e.g. port 22172, I can see that a list with 1509 bytes and cnt=66 gets acked correctly. This is the list with client 55's name still missing. I can then see acks of client list coming back from the test client with cnts of 67, 68 and again 68, but I can't see the outbound client list. This is because it is fragmented at the sending end. The tcpdump captures the first fragment, with packet length 1514, but my tcpdump expression evidently doesn't match the remaining fragment of the same message.

Once the client list starts to drop again (Windows client disconnects), I can see the outbound lists to the test client again, and they continue to get acked.

softins commented 4 years ago

1509 is just over the 1500 byte limit for Ethernet frames.

So would it get fragmented or just dropped? I don't know how that situation is meant to be handled.

It is supposed to be handled by the OS network stack, not by the application. If a network node wants to send an IP packet larger than the MTU for its egress interface, it is supposed to:

  1. Check the DF bit in the header. If set, discard the packet and notify the sender using ICMP. This is how path MTU discovery works for TCP.
  2. If DF is clear, divide the packet according to the MTU size and use the fragmentation bits in the IP header to identify the fragments for reassembly by the receiving end. https://www.wains.be/pub/networking/tcpdump_advanced_filters.txt has some useful description.

Apparently some routers will not allow fragmented IP datagrams. I don't know if they think it is more secure. I believe there are some kinds of DDoS attack that send unmatched fragments to fill up a router's or client's network stack memory with pending fragments.

In this case, it looks like the MTU to Volker's Windows client is a little less than the maximum, and the 1509-byte packet is not making it through. The maximum for Ethernet is 1514, including the Ethernet header: 1500 bytes payload and 14 bytes header. So the 1509-byte message (including Ethernet header) is 5 bytes under the limit. But some network connections are set instead to slightly smaller sizes such as 1492 (can't remember why). If the MTU in one hop to Volker is 1492, that would explain why the 1509-byte packet doesn't get through (although it ought still to get fragmented).

corrados commented 4 years ago

The interesting thing is that if I put a "sleep" of about a second in the loop each time I create a new dummy client, there is no issue with the Windows client. It gets all protocol packets correctly and the protocol does not get stuck. Just if I create the dummy clients in a very short time, the UDP packet seems to get blocked forever. Very strange...

corrados commented 4 years ago

Anyway... We see problems with large UDP protocol messages and this should be solved. I am thinking of introducing an extra layer in the protocol to split large messages. I'll take a look at this now.

softins commented 4 years ago

Anyway... We see problems with large UDP protocol messages and this should be solved. I am thinking of introducing an extra layer in the protocol to split large messages. I'll take a look at this now.

The problem with this is that you have no easy way of knowing what the smallest per-hop MTU is between the server and any given client, so you don't know how small to split large messages. The TCP protocol takes care of this automatically using path MTU discovery as I mentioned above.

If you are going to make non-backward compatible changes to the protocol, maybe for Jamulus V4, I would recommend redesigning it completely to use TCP connections for data/control messages and UDP for audio streams. This is how many telephony protocols work, and would allow, for example, a client to connect using TCP and establish the connection parameters before starting audio. Also, a listen-only client could then maintain a connection without having to send an audio stream to the server. The TCP messages could be textual and line-based, similarly to SIP or HTTP, which would aid parsing. The audio streams could maybe use the standard RTP format.

I don't underestimate the size of such a project, though! It would be quite substantial.

corrados commented 4 years ago

I already wrote a comment to that in another Issue, see: https://github.com/corrados/jamulus/issues/255#issuecomment-683310370

corrados commented 4 years ago

If you are going to make non-backward compatible changes to the protocol

No, that will not be non-backward compatible. I'll do it so that it will be compatible to older versions of Jamulus.

softins commented 4 years ago

I've updated my tcpdump script to capture fragmented IP packets too (which Wireshark will reassemble for display):

[root@vps2 ~]# cat capture-jamulus-proto.sh
#!/bin/sh

DATE=`date '+%Y%m%d-%H%M%S'`
FILE=jamulus-proto-$DATE.pkt

cd /var/tmp

tcpdump -C 32 -i eth0 -nn -p -s0 -w $FILE '(ip[6:2]&0x3fff) != 0 or (udp portrange 22120-22139 and (udp[8:2] == 0 and udp[4:2]-17 == (udp[14]<<8)+udp[13]))' </dev/null >/dev/null 2>&1 &

Of course, portrange 22120-22139 can be replaced by a specific port designation such as port 22124

storeilly commented 4 years ago

Thanks Tony, capture now running as above on 22133

corrados commented 4 years ago

@storeilly I have implemented a proof-of-conecpt code of splitting large UDP messages to avoid the problem we see with a lot of clients connected. I have created a tag for it: proof_of_concept_protsplit (https://github.com/corrados/jamulus/releases/tag/proof_of_concept_protsplit). I tested it quickly but it could contain any sort of instability and security issues. And also it is not compatible to old Jamulus versions (yet). But we may do a quick test on your server if you like. Then I could make sure that the problem is really solved. If you want to compile/run that code, you should shut down that Jamulus server shortly after I have done my testing.

storeilly commented 4 years ago

Running on same port 22133

corrados commented 4 years ago

Wow, you are super responsive!

But unfortunately, I still see the issue :-(. What I can see with your server "Jam 270a5c39 33" is that it takes a little bit longer until the mixer board faders get their label which is expected since we now have more "ping-pong" communication between the clients but still, the protocol get's stuck for the Windows client. So maybe it seems not to be an issue with the protocol size which causes the protocol message not to reach the Windows client.

Have you stored the wireshark log for that server? Maybe @softins could check that the protocol UDP packets are really smaller now and therefore verify that my implementation does work correctly.

softins commented 4 years ago

Sure, let me know when it's up, and I'll have a look sometime tomorrow (Monday)

storeilly commented 4 years ago

Wow, you are super responsive!

This is important to me, I'm speaking to the Irish Choral Association tomorrow evening promoting this! :)

This is the run command.... ./Jamulus-270a5c39 -n -d -T -l /var/log/jamulus/jamulus11.log -u 100 -e jamulusjazz.fischvolk.de:22324 -o "Jam 270a5c39 33;Dublin;104" -p 22133 -s

But it didn't write the log file!

tcpdump file uploaded to dropbox.... jamulus-proto-20200906-201952.pkt and server now stopped. That's all for tonight though!

maallyn commented 4 years ago

Folk:

I will have some free time tonight (U.S. Pacific time, when most of you are in bed); I also plan to set up a 2nd server to kick off clients in the same data center as my dedicated server (newark-music.allyn.com) so I could bang with with clients at a faster clip than from my current setup with clients running from my Seattle server. I am wondering what tests I could do for you either for this ticket or with #547. I also have a question; if I want to create some test code, can I set up a separate branch without affecting the main tip and what everyone else is doing?