Server performance & optimization

WolfganP commented 4 years ago

Follows from https://github.com/corrados/jamulus/issues/339#issuecomment-657076545 for better focus of the discussion.

So, as the previous issue started to explore multi-threading on the server for better use of resources, I first run a profiling of the app on debian.

Special build with: qmake "CONFIG+=nosound headless noupcasename debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro && make clean && make -j

Then run as below, and connecting a couple of clients for a few seconds: ./jamulus --nogui --server --fastupdate

Once disconnecting the clients I gracefully killed the server pkill -sigterm jamulus

And finally run gprof, with the results posted below: gprof ./jamulus > gprof.txt

https://gist.github.com/WolfganP/46094fd993906321f1336494f8a5faed

It would be interesting to see those who observed high cpu usage run test sessions and collect profiling information as well to detect bottlenecks and potential code optimizations, before embarking on multi-threading analysis that may require major rewrites.

pljones commented 4 years ago

USER    PID  SPID CLS PRI COMMAND COMMAND 
jamulus 1509 1517 TS   39 QThread /usr/local/bin/Jamulus 
jamulus 1509 1509 RR  139 Jamulus /usr/local/bin/Jamulus

My guess as to why this set up (which I've been using, to) has some instability in network performance, is the QThread ("high priority") socket thread is being busied out by the main Jamulus thread.

``` block ``` preserves formatting, by the way

dingodoppelt commented 4 years ago

USER         PID    SPID CLS PRI COMMAND         COMMAND
jamulus     1842    1852 TS   39 QThread         /usr/local/bin/Jamulus -s -F -n -u 25 -w <br><h1 style=text-align:center>FetteHupeBackstage</h1><p style=text-align:center>Willkommen auf dem <b>privaten</b> Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -o FetteHupe;Frankfurt;82
jamulus     1842    1842 RR  139 Jamulus         /usr/local/bin/Jamulus -s -F -n -u 25 -w <br><h1 style=text-align:center>FetteHupeBackstage</h1><p style=text-align:center>Willkommen auf dem <b>privaten</b> Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -o FetteHupe;Frankfurt;82

I relaunched the server and pasted the new output.

My guess as to why this set up (which I've been using, to) has some instability in network performance, is the QThread ("high priority") socket thread is being busied out by the main Jamulus thread.

That sounds very interesting. What priority or scheduling setting would you recommend then? I'm having issues with my two core server and maybe I can get any "audible" improvements on that one.

WolfganP commented 4 years ago

If you add cpuid to the ps line it will also show which core the thread is assigned to: ps axwwH -eo user,pid,spid,class,cpuid,pri,comm,args | sort +4n | grep '^jamulus\|^USER'

pljones commented 4 years ago

You could apply this patch (it's a git diff, so patch -p1 to apply):

diff --git a/src/socket.h b/src/socket.h
index 232725ed..f36055d9 100755
--- a/src/socket.h
+++ b/src/socket.h
@@ -149,8 +151,9 @@ public:
     void Start()
     {
         // starts the high priority socket receive thread (with using blocking
-        // socket request call)
-        NetworkWorkerThread.start ( QThread::TimeCriticalPriority );
+        // socket request call - priorty is "ignored" on linux: actually broken)
+        //NetworkWorkerThread.start ( QThread::TimeCriticalPriority );
+        NetworkWorkerThread.start();
     }

     void SendPacket ( const CVector<uint8_t>& vecbySendBuf,

If you also apply #491 you'll get the thread names (and, if you use the recorder, that won't run RT).

In the meantime, I've been trying to get pThreads priorities working...

diff --git a/src/socket.h b/src/socket.h
index 232725ed..f36055d9 100755
--- a/src/socket.h
+++ b/src/socket.h
@@ -34,6 +34,8 @@
 #ifndef _WIN32
 # include <netinet/in.h>
 # include <sys/socket.h>
+# include <sched.h>
+# include <pthread.h>
 #endif

@@ -169,7 +172,18 @@ protected:
     {
     public:
         CSocketThread ( CSocket* pNewSocket = nullptr, QObject* parent = nullptr ) :
-          QThread ( parent ), pSocket ( pNewSocket ), bRun ( true ) { setObjectName ( "CSocketThread" ); }
+          QThread ( parent ), pSocket ( pNewSocket ), bRun ( true )
+       {
+           setObjectName ( "CSocketThread" );
+
+#ifdef SCHED_RR
+struct sched_param param;
+param.sched_priority = sched_get_priority_max ( SCHED_RR ) -
+    ( sched_get_priority_max ( SCHED_RR ) - sched_get_priority_min ( SCHED_RR ) ) / 5;
+qInfo() << "SCHED_RR defined, setting sched_priority to" << param.sched_priority;
+pthread_setschedparam( pthread_self(), SCHED_RR, &param );
+#endif
+       }

         void Stop()
         {

That had no effect...

dingodoppelt commented 4 years ago

USER         PID    SPID CLS CPUID PRI COMMAND         COMMAND
jamulus    11844   12162 TS      1  39 CHighPrecisionT /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    11844   12164 RR      1 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    11844   12163 RR      2 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    11844   11844 RR      3 139 Jamulus         /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    11844   11856 TS      3  39 QThread         /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    11844   12165 RR      3 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    11844   12166 RR      3 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82

Here is one more with 25 clients connected.

@pljones : Thanks, I'll try that on my two core server first to listen for improvement :)

dingodoppelt commented 4 years ago

@pljones : this is ps with all your patches applied

USER         PID    SPID CLS CPUID PRI COMMAND         COMMAND
jamulus    13406   13418 RR      1 120 CSocketThread   /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    13406   13422 RR      1 120 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    13406   13430 RR      1 120 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    13406   13420 RR      2 120 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    13406   13426 RR      2 120 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    13406   13406 RR      3 120 Jamulus         /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    13406   13419 TS      3  39 CHighPrecisionT /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82

stresstest

WolfganP commented 4 years ago

Thanks @dingodoppelt , good info. What puzzles me is the ps output and the htop screenshot doesn't match (diff process ids involved, server 13406 vs htop 1346*). Are they from diff machines or sessions? Are the clients running on the same machine as the server?

BTW, in htop you can add columns for PROCESSOR and NLWP (number of threads) that may be useful for the analysis as well.

dingodoppelt commented 4 years ago

Screenshot_20200805_201639

@WolfganP : this is htop with the suggested columns added. I've relaunched the server and patched it a few times yesterday (this morning, actually ;) could be I mixed up some screenshots and command outputs. here it is all together in one post

jamulus    15106   15369 RR      0 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    15106   15372 RR      0 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
USER         PID    SPID CLS CPUID PRI COMMAND         COMMAND
jamulus    15106   15106 RR      2 139 Jamulus         /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    15106   15368 RR      2 139 CHighPrecisionT /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    15106   15371 RR      2 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    15106   15107 RR      3 139 CSocketThread   /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82
jamulus    15106   15370 RR      3 139 Thread (pooled) /usr/local/bin/Jamulus -s -n -F -u 25 -w <br><h1 style=text-align:center>FetteHupe</h1><p style=text-align:center>Willkommen auf dem Jamulus Server der Big Band <b>Fette Hupe</b> aus Hannover</p><p style=text-align:center><b>https://fettehupe.de</b></p><p style=text-align:left></p> -e jamulus.fischvolk.de -o FetteHupe;Frankfurt;82

corrados commented 4 years ago

I enabled it now and it really keeps the CPU usage down and nicely balanced.

This is really good new. This is what I wanted to achieve with the threading implementation. The rest is fine-tuning (thread priorities, thread synchronization, etc.).

What I found still is that audio quality degrades at a certain amount of clients depending on their usage of small network buffers though the CPU cores aren't maxed out. But I don't know how much I can trust my tests since I'm running all clients on one machine and internet connection.

Running so many clients on the same machine, I would expect that you get audio issues. A better test would be to have a lot of individual clients connecting to a server which has a fast CPU and a high network bandwidth available. Maybe we should ask @sthenos to run one of his Waiting Rooms at the World Jam in qmake "CONFIG+=multithreading" mode.

corrados commented 4 years ago

I just did a quick test with my laptop which has a dual core CPU with hyper threading (shows 4 cores in htop). I connected 20 clients from my Windows PC over ethernet. In normal mode, the CPU load of the Jamulus server was about 80-90%. In multithreading mode, the total Jamulus server load was about 180-190% but the load was equally distributed over the cores. So it seems we still have some threading overhead but not as much as we saw with the OMP implementation. But if the thread overhead is 100%, you would not gain much on a dual core CPU. But if you would have 4 or more cores, then I assume the multithreading Jamulus server will be able to handle more clients (which has to be proven by real-world tests I guess).

WolfganP commented 4 years ago

100% overhead is too much, almost weird. I wonder if the concurrent threads are dividing up work or just processing the whole array of clients inputs in parallel unnecessarily...

pljones commented 4 years ago

Each connected client needs the audio from all the connect clients to create its own mix.

So, for N audio inputs, you need to:

read and decompress N streams
mix N streams of input to N streams of output
compress and send N streams

Each of those tasks can be parallelised but there's a hard serialisation point between them where all the data needs exchanging with the next step - for each frame.

  R      M      s
 /-\    /-\    /-\
.---.->.---.->.---.
 \-/    \-/    \-/

You want to limit the number of threads to a count based on the available hardware cores (not virtual cores, so a four core with hyperthreading, means 4 cores) - as otherwise you drive the CPU into overload with task switching. You also need to reserve a thread / core for "other work". Looks like there are three "pools" for threads - read / mix / send, then distribute over that.

Now I have to go read the code, I guess 😁 .

WolfganP commented 4 years ago

So, for N audio inputs, you need to:
* read and decompress N streams

* mix N streams of input to N streams of output

* compress and send N streams
Each of those tasks can be parallelised but there's a hard serialisation point between them where all the data needs exchanging with the next step - for each frame.

Completely agree. Due to the nature of the data and flow, there are 3 clear serial steps that will benefit from being parallelized by each own onside the OnTimer clicks. It also implies that data structures need to be optimized for that parallelization (ie global preallocated buffers/arrays, accessible by all parallel threads), one thread being reserved for UI-I/O-ancilliary concurrent functions.

pljones commented 4 years ago

A brief look and it seems that the timer calls "read and decompress N streams", than then pushes "mix N streams of input to N streams of output" and "compress and send N streams" into a concurrent future, and then waits for that to complete. To me, that looks like a single serial operation. I think something more complex like QtConcurrent::map might be better (three of them, serialised), once the code structure fits.

I like the QtConcurrent use generally, though - it looks like it takes a lot of the pain out of implementing the details (the core balancing, etc).

corrados commented 4 years ago

Audio encoding takes much more time than to decode. What is implemented right now is that the audio decoding of all input streams is done in one thread. Then the mixing and encoding of each individual output stream is done in separate threads. So it is actually not three steps as you depicted but only two of them. I actually don't see a need to do parallel processing of the first part since this would introduce additional thread overhead.

pljones commented 4 years ago

I don't think it's done in multiple threads - QtConcurrent::run() appears to run the entire workload in a single thread, from the way I read the manual. QtConcurrent::map() could iterate over a list, processing each entry in a separate thread, balancing across available resources.

At the moment, each OnTimer() call reads all the data in then creates a single thread to mix, encode and send that data, waiting for the thread to complete. So there are two threads: the timer thread itself and the mix/encode/send thread.

If each "tick" is kicking off a separate OnTimer() thread, then that might work -- but it's still a serialised process path as it stands, as far as I can tell.

[thread collapse edit] I will say I was wrong on the structure though - it is just the one sync point, not two, so

  R      M  S
 /-\    /----\
*---*->*------*
 \-/    \----/

[thread collapse edit] OK, I see I'd missed the outer for() {} block that contains the ::run(). But I think it would still be better replaced with a ::map() - seems more "in style".

corrados commented 4 years ago

Are there any real-world tests with the new multithreading implementation done? I am curious if we can serve more clients with the multithreading option enabled on a multiprocessor (>= 4 cores) compared to the Jamulus server running on the same CPU where the multithreading option is not enabled.

If anybody has already performed these type of tests, please share your experiences here.

WolfganP commented 4 years ago

As kind of makeup lab, I'm setting up a headless raspberry to host multiple clients and leave the server in a separate machine, but still struggling with the audio part (no external audio interface other than the integrated bcm2835) to have an stable setup (following your https://github.com/corrados/jamulus/blob/master/distributions/raspijamulus.sh as guidance btw)

pljones commented 4 years ago

Can the multithreading_testing branch get rebased to / merged from latest master at some point, too, please. (With the modules, as otherwise changing branch complains...)

corrados commented 4 years ago

done

pljones commented 4 years ago

OK, running with the latest version and 8 clients: https://drive.google.com/file/d/1vYHyFfiD1vztk-QN0ECeVFzZ2ADI-9vy/view?usp=sharing (severe apologies for the image... only had my phone handy as a terminal and this was about the only way to grab the data).

%CPU is 4.0 for CSocketThread and JamulusServer; ~1.8 for the pooled threads and 0.2 for CHighPrecisionTimer -- those are running at SCHED_FIFO priority 60; %CPU for the JamRecorder is 0.7 running SCHED_OTHER, nice 0. The spread across the cores is good: CSocketThread on 0, JamulusServer on 3, CHighPrecisionTimer on 1, pool using 0, 2, 4 and 5, with the JamRecorder on 2.

I didn't get the overall system load figures, unfortunately.

storeilly commented 4 years ago

I ran a test with assistance on 4 instances on the same AWS server tonight (32 GB RAM, 8 vCPUs) 2 of the instances were compiled with multithreading and the others normal. The normal servers had good sound up to 35 users approx (virtual users and two normal users) The multithreaded servers had a lower limit before the sound deteriorated of about 25. I was watching htop as we added users. The multithreaded servers seemed to spread the load and use all cpu's until about 15-20 users then appeared to reach some point when most of the processing jumped onto one cpu. I'm going to check I compiled correctly and re-test in a few days. Hope this is of some help!

WolfganP commented 4 years ago

@storeilly if you can test with similar loads (ie 25 ppl in one mono-thread server and then a similar load in a multi-thread server) could you please check (if you can capture screenshots of htop the better) the total system load? The total shouldn't be so different (ie 80% in one core vs ~20% in each of 4 cores)

The idea is to validate the system is distributing the load well, and not just replicating the same work on diff cores.

storeilly commented 4 years ago

I considered screenshots but the load was jumping across cpu's and a video might be better. It didn't seem balanced though which leads me to think I've made a mistake. Is it possible to move the control for multithreading into a command line switch instead of a build config switch? Thanks guys for the great work! 👍

On Thu 13 Aug 2020, 23:45 WolfganP, notifications@github.com wrote:

@storeilly https://github.com/storeilly if you can test with similar loads (ie 25 ppl in one mono-thread server and then a similar load in a multi-thread server) could you please check (if you can capture screenshots of htop the better) the total system load? The total shouldn't be so different (ie 80% in one core vs ~20% in each of 4 cores)

The idea is to validate the system is distributing the load well, and not just replicating the same work on diff cores.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/corrados/jamulus/issues/455#issuecomment-673746507, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIJSQOX2Q3TSTC434IKTDLSARUKBANCNFSM4OXQXIXQ .

corrados commented 4 years ago

The multithreaded servers had a lower limit before the sound deteriorated of about 25.

Thank you for that test. That shows that the current multithreading implementation does not give you any improvement. I'll mark this feature as "experimental" in qmake for now.

I have some more ideas to improve the situation with multithreading. I'll do some more tests.

Is it possible to move the control for multithreading into a command line switch instead of a build config switch?

Yes, I think this should be possible.

corrados commented 4 years ago

@storeilly I have just implemented another test. I now do not create a thread for each client but I work on blocks of clients. With the current test code I create a new thread if the number of clients exceeds 20. So if you have 50 clients, you will get two threads working on 20 clients and one thread working on 10 clients. I hope that this reduces the threding overhead.

I put the changes on our multithreading testing branch: https://github.com/corrados/jamulus/tree/multithreading_improvements I also set a new label to the test code: https://github.com/corrados/jamulus/releases/tag/multithreading_testing2

Unfortunately, I cannot test the new code if it works correctly. So it may be that it does not work correctly or even crashes. At least I am pretty confident that it should work just fine. If you are interested to test this new code like you did before, it would be great to get some feedback.

pljones commented 4 years ago

It's running (and I can, at least, connect) on my server. I'll see how it behaves over the weekend.

By the way, did you explicitly push the tag? I don't see it.

corrados commented 4 years ago

Great. But which server is it? The jamulus.drealm.info? Any yes, I pushed the tag. And it is listed on Github: https://github.com/corrados/jamulus/tags

pljones commented 4 years ago

Ah, right - git pull needs --tags, too. I can probably config it on by default...

And yes - I'm not putting anything but release builds on the server lists.

corrados commented 4 years ago

One thing I want to mention when it comes to the maximum supported number of connected clients: The worst case for the server is if it is running in "--fastupdate" mode but most of the clients use the 128 samples buffers. If the server also runs on 128 sample block size, on each timer call it has about 2.6 ms time to decode/encode the OPUS packets. But if the server runs on 64 samples block size and receives a 128 samples block, it stores the big block and only decodes/encodes every second timer call. But in that case the time to finish the decoding/encoding of the 128 block is in that case just about 1.3 ms (i.e. half the time). Therefore you may hit the sound deterioration limit earlier.

So it would make sense to do your tests with/without the --fastupdate flag.

dingodoppelt commented 4 years ago

I just sysbenched my two servers as well as my PC and what was audible on the servers was visible in numbers in sysbench (the cheapest one not being able to run on small buffers). Does anybody know of a benchmark that would produce numbers we can compare and predict the performance of the jamulus server? I don't know which benchmarks to try because I don't know which would simulate or benchmark the features we are interested in. The Phoronix test suite offers a multitude of benchmarks to choose from. Has anybody made some test of that kind already?

storeilly commented 4 years ago

I've setup a group of test servers on the same AWS server for another go at this test later this week.. 6 in total. 4 with all combinations of Fast updates on/off and multithreading on/off and the last 2 at r3_5_10 with and without -F Have you any suggestions of what to record except screenshots of htop?

corrados commented 4 years ago

@storeilly Please use the tag multithreading_testing2 instead of r3_5_10 since r3_5_10 does not include the latest changes to the multithreading code.

storeilly commented 4 years ago

I am using that tag on 4 of the instances.. only 2 of them have the 3.5.10 for comparison.

On Mon, Aug 17, 2020 at 4:40 PM Volker Fischer notifications@github.com wrote:

@storeilly https://github.com/storeilly Please use the tag multithreading_testing2 instead of r3_5_10 since r3_5_10 does not include the latest changes to the multithreading code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/corrados/jamulus/issues/455#issuecomment-674955480, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIJSQPEE5VKAVEUYLYL25DSBFFPXANCNFSM4OXQXIXQ .

storeilly commented 4 years ago

We ran out of time testing last night and only got through 3 instances. Provisional numbers are... Key S= above tag built without multithreading, T= above tag with multithreading, F= fast updates enabled on server, #= fast updates disabled on server, f= fast updates enabled on client, != fast updates disabled on client.

S# f 23 clients S# ! 36 clients SF f 21 clients SF ! 40 clients T# f 60 clients T# ! 72 clients

We hope to hit that last server and monitor htop before the weekend however it looks great with 72 clients!!

On Mon 17 Aug 2020, 16:51 Stephen OReilly, soreilly64@gmail.com wrote:

I am using that tag on 4 of the instances.. only 2 of them have the 3.5.10 for comparison.

On Mon, Aug 17, 2020 at 4:40 PM Volker Fischer notifications@github.com wrote:

@storeilly https://github.com/storeilly Please use the tag multithreading_testing2 instead of r3_5_10 since r3_5_10 does not include the latest changes to the multithreading code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/corrados/jamulus/issues/455#issuecomment-674955480, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIJSQPEE5VKAVEUYLYL25DSBFFPXANCNFSM4OXQXIXQ .

corrados commented 4 years ago

Wow! That looks very promising :-) Thank you for your tests.

BTW: There is a magic number in the multithreading code which might also be adjusted to support more clients: https://github.com/corrados/jamulus/blob/multithreading_improvements/src/server.cpp#L1024 Right now I have set it to 20 clients per CPU core. But maybe it makes sense to increase this number. Maybe to 25 or 30.

WolfganP commented 4 years ago

I'm still fighting with alsa's asoundrc to create a virtual soundcard that allows me to launch multiple clients with a wav sound source for "close to real life" testing, and have a way to measure performance under high loads in a "lab" environment (multiple jamulus clients in one machine ingesting a wav audio file, a jamulus server in another machine to allow performance measurement of diff strategies/versions).

So, if anyone has any deep understanding with alsa's plugins (particularly file) any help will be appreciated. Thx!

storeilly commented 4 years ago

We did have an anomaly that has me puzzled though. A couple of times during the client add and removal I could not connect to the server and hear sound. The issue presented itself only on the third server and with both my connections. I'm using W10 with a focusrite 4i4 as primary and a Raspberry pi with a UM2. Both setups worked fine on my other instances before and after, when the server was empty it was also fine. My friend Tord could hear me fine.

storeilly commented 4 years ago

I'm still fighting with alsa's asoundrc to create a virtual soundcard that allows me to launch multiple clients with a wav sound source for "close to real life" testing, and have a way to measure performance under high loads in a "lab" environment (multiple jamulus clients in one machine ingesting a wav audio file, a jamulus server in another machine to allow performance measurement of diff strategies/versions).

So, if anyone has any deep understanding with alsa's plugins (particularly file) any help will be appreciated. Thx!

I've used jaaa to inject a tone from a headless server with jack using a dummy driver

maallyn commented 4 years ago

Folks: I now have this tag running on a server in the U.S. East Coast (Newark NJ; Linode; 4 CPU 8 GB. It is newark-music.allyn.com.

Here are the steps that I did:

On Wednesday, August 20 at about 11 AM I did the following:

git clone https://github.com/corrados/jamulus.git cd into jamulu git fetch --all --tags git tag (just to confirm that multithreading_testing2 is there git checkout multithreading_testing sudo apt-get install build-essential qt5-qmake qtdeclarative5-dev libjack-jackd2-dev qt5-default qmake "CONFIG+=nosound multithreading" Jamulus.pro make clean make

The server is set for a maximum capacity of 100 users.

Server is Linode Ubuntu 16.04 LTS, Linode 8GB: 4 CPU, 160GB Storage, 8GB RAM

Server name is newark-music.allyn.com

You are welcome to try to stress it and break it.

Unfortunately, I don't know how to create the many virtual clients. I tested with two real clients on the same Windows PC.

I do have a question; it is appropriate for me to ask for people to try to break this on our Facebook groups?

Thank you Mark Allyn

maallyn commented 4 years ago

I did have a comment about my setup on newark-music.allyn.com; this from Mats Wessling via Facebook: "testing it from sweden now and there is a very weird effect: the relays lags! if i lower my level it takes it only lowers very slowly and gradually! almost same with mute: mute is also delayed but is abrupt (not gradually) my delay is 111ms ping 98 ms"

This got me curious. I got on from my home location in Bellingham, Washington, which has 141 total delay. What I did notice is that if I quickly push the fader all the way down, it seems to take about 1/2 to close to 1 second for the mute light to come on above my slider and about the same time for me to fade out in volume. The mute button and the mute myself button seem to act faster. The delay in the volume control is noticeably greater than the delay between me clapping into the microphone and hearing my clap coming back.

I am wondering if there is a difference between the way the faders are handled by the server vs the way that either the mute myself button or the mute button on each person's fader in the main panel are being handled.

storeilly commented 4 years ago

We did have an anomaly that has me puzzled though....

I've done some tests on the multithreading build and it appears that only channel 0 (or the lowest connected channel) gets to hear audio. I can only connect a handful of clients tonight (5 max 1x Pi and 4 from the PC) This might affect results for client count if the server is not processing audio to all clients. I'm looking into how I can route the clients through jack to monitor the output but in the meantime @corrados could you have a look at the code please to see if there is a reason for this? If you pm me here or on FB, I can give you the ip of my server (It's on a private chain) if you want a look.

corrados commented 4 years ago

We did have an anomaly [...] only channel 0 (or the lowest connected channel) gets to hear audio

Was this something you did not see before but is new? Did you have this also when you did your test you reported here: https://github.com/corrados/jamulus/issues/455#issuecomment-677373376

I can only connect a handful of clients tonight (5 max 1x Pi and 4 from the PC)

It this some temporary effect then?

in the meantime @corrados could you have a look at the code please to see if there is a reason for this?

I am not at home right now. I can do it as soon as I am back (in some days).

dingodoppelt commented 4 years ago

only channel 0 (or the lowest connected channel) gets to hear audio.

I've had that problem, too. @maallyn: I checked your server just now and it has the same problem. Only the first channel receives audio.

maallyn commented 4 years ago

Thank you for letting me know. Tomorrow, I will restore the server back to the tip of the GIT tree and not use the multithread as the server is useless as it is now. It's 2 AM here in Bellingham and I am too tired to touch that server without disaster.

storeilly commented 4 years ago

Thanks

Was this something you did not see before but is new? Did you have this also when you did your test you reported here: #455 (comment)

Yes, but as there were only two of us testing, using Jamulus and Zoom to communicate we knew there was a problem but I assumed it was on my end. Only further testing identified the actual issue. We ran out of time and I reported here

I can only connect a handful of clients tonight (5 max 1x Pi and 4 from the PC) It this some temporary effect then?

No I don't have access to the multiple client setup to test further.

And no I've had about seven live clients connect last night to verify the issue, only the leftmost client could hear everyone. I've also tested maallyn's server and it is the same.

Further info ... Disconnecting the leftmost client restores incoming sound to the next in line to the right, and so on but reconnecting a client into a free slot GetFreeChan it gets a lower slot and the sound. There is a further and possibly linked issue with the pan and vu meters of the clients without sound, but I did not go too deep into diagnosing this as it is most likely related to this issue.

I am not at home right now. I can do it as soon as I am back (in some days).

That's fine! - A solution that takes time is usually more effective in my book. Thank you sir and enjoy your break!

maallyn commented 4 years ago

Folks: I don't know if this is relevant, but I got comments on Facebook that there is something weird with behavior of Linode droplets in that there is periodic distortion of the sound. I recompiled my instance at newark-music.allyn.com to be from the tip (took out the multithreading-2 tag and compile flag. This way, the server is running the production code. I will leave it this way until I hear an update from this ticket when there is an update to this branch. In the meantime, I am offering to temporarily upgrade that droplet to a dedicated CPU to see if there are audio glitches.

storeilly commented 4 years ago

Hi Mark, For stability you may have more success compiling from a release tag (r3_5_10) if you are not sure of the reliability of the platform. I only use released versions on my servers unless testing a particular feature such as this.

On Sat 22 Aug 2020, 20:12 Mark Allyn, notifications@github.com wrote:

Folks: I don't know if this is relevant, but I got comments on Facebook that there is something weird with behavior of Linode droplets in that there is periodic distortion of the sound. I recompiled my instance at newark-music.allyn.com to be from the tip (took out the multithreading-2 tag and compile flag. This way, the server is running the production code. I will leave it this way until I hear an update from this ticket when there is an update to this branch. In the meantime, I am offering to temporarily upgrade that droplet to a dedicated CPU to see if there are audio glitches.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/corrados/jamulus/issues/455#issuecomment-678680353, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIJSQLN5TUX7P5XJH4KGT3SCAKAZANCNFSM4OXQXIXQ .

maallyn commented 4 years ago

Storeilly: Thank you. I just re-cloned to the r3_5_10 tag and re-compiled without the multi-threading; and re-started the server. Next week, I will temporarily upgrade to 2 CPU dedicated server and ask everyone to try to replicate the audio quality problem. If that problem persists, then we know it may be a Linode specific problem and my testing of any future patches on the multi-threading group work may not be valid unless I move to OVH or Vultr.

corrados commented 4 years ago

@storeilly Too bad, I had a bug in the multithreading Jamulus server which caused the server not to process the connected clients audio correctly. Therefore your test, unfortunately, with, e.g., "T# ! 72 clients" did not give any useful results with that previous buggy Jamulus code since the OPUS encoding of the clients was not done at all.

Anyway, I think I have fixed the bug and I have created a new label now: https://github.com/corrados/jamulus/releases/tag/multithreading_testing3

Sorry for the inconvenience.

jamulussoftware / jamulus

Server performance & optimization #455