WolfganP commented 4 years ago

Follows from https://github.com/corrados/jamulus/issues/339#issuecomment-657076545 for better focus of the discussion.

So, as the previous issue started to explore multi-threading on the server for better use of resources, I first run a profiling of the app on debian.

Special build with: qmake "CONFIG+=nosound headless noupcasename debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro && make clean && make -j

Then run as below, and connecting a couple of clients for a few seconds: ./jamulus --nogui --server --fastupdate

Once disconnecting the clients I gracefully killed the server pkill -sigterm jamulus

And finally run gprof, with the results posted below: gprof ./jamulus > gprof.txt

https://gist.github.com/WolfganP/46094fd993906321f1336494f8a5faed

It would be interesting to see those who observed high cpu usage run test sessions and collect profiling information as well to detect bottlenecks and potential code optimizations, before embarking on multi-threading analysis that may require major rewrites.

storeilly commented 4 years ago

I'll leave the proper report to the expert @softins I only dabble but I will say that there are 4 packets with frame length of 1518 CLM_CONN_CLIENTS_LIST (59 clients) and CONN_CLIENTS_LIST (59 clients) and many fragmented reassembled packets.

softins commented 4 years ago

Not quite sure what was happening on that test (20200906-201952), but it looks the same as the previous test. Either the server didn't have Volker's latest test changes, or they didn't have the desired effect. The size of the CONN_CLIENTS_LIST grows until all but one of the names have been filled in, at which point the whole packet is 1509 bytes long. This gets acked by the test clients, but not by the Windows client. When the last name is filled in, the list gets fragmented into two IP packets, the first of which is 1514 bytes long. These messages continue to get acked by the test clients, but the communication with the Windows client is stalled.

softins commented 4 years ago

@storeilly it's not a solution, but might be a useful data point, if you could reduce your MTU by 8 bytes, and then Volker repeat the test with standard server code. This will cause fragmentation to kick in slightly earlier, and might then keep the Windows client going, if indeed it can receive fragmented packets (does the Windows client correctly display the Default Server list? I imagine it must do).

To do this, on the fly:

# ifconfig eth0 mtu 1492 up

(replacing eth0 with the actual interface name, of course)

storeilly commented 4 years ago

I don't have time to do anything much with the server today, but I've amended the MTU (thank you) ubuntu@ip-172-26-14-115:~$ ifconfig | grep -i MTU eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1492 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

and restarted server 10. The binary I was using last night is now in dropbox Jamulus-270a5c39

corrados commented 4 years ago

Not quite sure what was happening on that test (20200906-201952), but it looks the same as the previous test.

Sorry, my fault. The code I've done yesterday was a bit too much of a "quick hack". It obviously does not work as expected which I found out today when I tried it out again. So I need some more time to do it correctly and then we should do the test again.

maallyn commented 4 years ago

Can I create my own test branch? Or will that harm the codebase? I do not plan to merge it into the main.

corrados commented 4 years ago

Sorry, I forgot to answer your question. Yes, of course, you can do whatever changes you want on your private branch in your fork. It will not harm the codebase.

maallyn commented 4 years ago

Thank you

corrados commented 4 years ago

@storeilly and @softins I made good progress on my split_messages Git branch. The new protocol now seems to work. I also have already implemented the backward compatibility and added some security checks. Right now there are still a lot of debug messages added to the code which have to be removed and also some cleanup and improvement of the code must be done. But I am pretty sure that latest at the end of the weekend the new code should be ready for testing.

softins commented 4 years ago

That sounds good. Once I have details of the new protocol, I will update my Wireshark dissector to understand it.

storeilly commented 4 years ago

Excellent, well done!! Can't wait 👍

corrados commented 4 years ago

A first working version of the new protocol split code is finished and can be tested. I created a new tag for it: feature_protosplit (https://github.com/corrados/jamulus/releases/tag/feature_protosplit). I hope this time it works correctly.

corrados commented 4 years ago

@storeilly I have merge to code to the master branch now. It would be good if you could setup a Jamulus server with the new code so I could verify that the problem is solved now.

genesisproject2020 commented 4 years ago

@corrados I just compiled a server with the master branch code. (server name "SWE mt test" under Jazz)

storeilly commented 4 years ago

@storeilly I have merge to code to the master branch now. It would be good if you could setup a Jamulus server with the new code so I could verify that the problem is solved now.

@corrados That's up now on port 22134 with tcpdump monitoring. @softins I was tempted to edit the filter to include the local ip, but wasn't certain of the effect. The objective is to include only the local 22134 port packets, as I was collecting data before the server had been started. That file is in dropbox.

softins commented 4 years ago

I can have a look at anything tomorrow (Sun). Not available today, sorry.

corrados commented 4 years ago

That's up now on port 22134 with tcpdump monitoring

Good news: The problem is solved now :-). I just did the test on the "Jam mt 26" and could see the issue as expected since it is the old server version. Then I did exactly the same test on the new "Jam 1a3d2651 34" and the issue was gone. I could see all clients on the mixer board and I could send and receive chat text messages. So the split messages algorithm works fine.

storeilly commented 4 years ago

Excellent!!!! @brynalf

corrados commented 4 years ago

I just compiled a server with the master branch code. (server name "SWE mt test" under Jazz)

Thanks. I tried that out, too, and it seems it has very similar performace as storeillys server.

WolfganP commented 4 years ago

Excellent work everyone. Scalability solidly grew from the starting of this effort. What will be the area of focus for performance optimization now? (if it's there really anything known that may provide a huge jump in terms of performance or stability)

P.S.: Now the attack of the client drones to the server fortress may restart, how many will it hold this round? May the jamforce be with you :-)

corrados commented 4 years ago

There is still room for improvement. On storeillys server I could run successfully about 70 clients using Stereo (I have not yet tried in Mono mode). It seems that the protocol management thread stalls first when at the same time you can still get some useful audio back from the server. Maybe it is an issue of thread priorities or blocking threads. Some more investigations should follow...

genesisproject2020 commented 4 years ago

@corrados I just compiled a server with the master branch code. (server name "SWE mt test" under Jazz)

I have now added some scripts that can be started from a webpage to be able to compile a new version of the code, start and stop tcpdump and possibility to download the cap file directly. If any of you that do testing like to have access to the server, just send me a message.

WolfganP commented 4 years ago

There is still room for improvement. On storeillys server I could run successfully about 70 clients using Stereo (I have not yet tried in Mono mode). It seems that the protocol management thread stalls first when at the same time you can still get some useful audio back from the server. Maybe it is an issue of thread priorities or blocking threads. Some more investigations should follow...

It may also be an issue with OnTimer overrun (in the sense that all audio processing -decompress+mix+compress- doesn't fit in the allotted time between timer "interrupts" with the async i/o packets threads also getting data in and out from the buffers/arrays). Maybe it'll help to add a check if timeout() triggers? (as per https://doc.qt.io/qtforpython/PySide2/QtCore/QTimer.html#accuracy-and-timer-resolution)

Beyond the initial process profiling we started with, it may be worthy to instrument the critical Qt queues inside the app to measure queue length and jobs consumption.

kraney commented 4 years ago

Following up on the testing I mentioned in issue #599 here, rather than continuing to hijack that thread...

I set up an experiment where I created two pages of channel data, so that the decode thread can write to one page at the same time that the mix threads read from the other one. With that experiment, I was able to raise my stable client count from 65 to 80 on the same class of machine.

At 80, it seems like the decode thread reaches 90+% CPU usage for the one core it's on, and sound quality falls off rapidly as more clients join. It seems like the next beneficial change would be to break the decode into blocks similar to the way the mix thread is handled. I'll probably try that experiment tomorrow.

corrados commented 4 years ago

@brynalf It would be interesting if you could do some testing of the new code provided by kraney in his pull request #653. Just clone from his repo (https://github.com/kraney/jamulus.git) and checkout his concurrency_tweaks branch. Then we would have a direct comparison of the performance gain between the code currently on master and with the tweaks.

brynalf commented 4 years ago

I will do my best to squeeze that activity in today Saturday.

brynalf commented 4 years ago

I have now compared 3.5.11 to the kraney version. Main conclusion: No difference in upper limit in my regular test environment, i.e. server on 16 core processor and emulated users on two 4 core processors + sound quality probe on a raspberry pi with an audio interface.

I have verified that I am not comparing the same version to itself.

In my setup both the 3.5.11 and the kraney versions have an audio quality based limit of 85 users for buffer 128, mono-in/stereo-out, high quality, i.e. 657 kbps.

In my setup both the 3.5.11 and the kraney versions have an audio quality based limit of 45 users for buffer 64 and small network buffers enabled, mono-in/stereo-out, high quality, i.e. 900 kbps

Processor core peaks never go above 70%.

brynalf commented 4 years ago

The emulated users are running 3.5.11 in all of the tests.

corrados commented 4 years ago

Thank you for testing.

Actually, these results were not expected. I knew that doing the decoding in one thread maybe a bottleneck if you have a lot of clients connected. This is one thing which was improved by kraney as far as I understood his code. And he reported about a gain in performance. The open question is now, what is the limiting factor on your system with such a huge amount of CPU cores.

kraney commented 4 years ago

FWIW I also tried testing with 10 cores, and got identically the same performance as with 8. I haven’t yet found what’s the new bottleneck.

WolfganP commented 4 years ago

Beyond the tweaking and the tests on scaling and the effects on audio + observed CPU usage, I still think we're "divining" what's the root cause on reaching the limits rather than knowing for sure if it's a timer overrun or some other interaction between the threads when on heavy load.

In order to detect if the timer overrruns (ie decode+mix+encode cycle not finished between timer's triggering) isn't it possible to log a counter of failed to complete audio processing cycles? (ie at https://github.com/corrados/jamulus/blob/2fd8f8ae3f03ae5018d0fdb660518d08318dbb35/src/server.cpp#L183 if I interpreted the code correctly)

kraney commented 4 years ago

I ran perf on the code this weekend when it was full of clients, and unsurprisingly Mix topped the list. No major smoking gun although spinlock showed up rather high, which suggests that there's a lot of time lost trying to obtain a Mutex so there might be benefit in trying to make the big lock around decode more granular.

pljones commented 4 years ago

although spinlock showed up rather high, which suggests that there's a lot of time lost trying to obtain a Mutex

Does that mean tasks were waiting on a Mutex doing nothing, or that the call to acquire the Mutex was using computation time heavily? Just checking -- I think you mean the latter.

kraney commented 4 years ago

A spinlock implies a thread is waiting for a lock without going to sleep, just using the CPU to wait. Going to sleep and waking up is expensive, so it's not necessarily a bad thing to do, but having it show up near the top of the profile results is kind of a red flag.

corrados commented 4 years ago

I have now applied some multithreading improvements from kraney (Thank you!) to the Git master. Today I did a test where I connected 99 dummy clients using stereo over my LAN (i.e. no internet) from my laptop to my Linux desktop PC (i5 with 4 CPU cores, no hyperthreading) and on that Linux desktop PC I run a normal Jamulus client so the total number was 100 clients. I had perfect audio quality on that normal client. Here is a screenshot of that session: Bildschirmfoto von 2020-10-05 20-10-07neu

corrados commented 4 years ago

With that experiment, I was able to raise my stable client count from 65 to 80 on the same class of machine. At 80, it seems like the decode thread reaches 90+% CPU usage for the one core it's on, and sound quality falls off rapidly as more clients join.

I assume you did this test on your 8-core CPU. Now we have to find out why I can successfully run 100 stereo clients on my 4-core CPU whereas this is not possible on your 8-core CPU.

What type of dummy clients are you using? For my test I am using a separate Linux laptop and I use a special client.cpp file where I do not call the OPUS encoding/decoding but only send random encoded OPUS bits to the server. Otherwise my laptop would not be able to run such a large number of dummy clients. Maybe the type of coded data the OPUS decoder at the server gets influences it's CPU usage? But since I send random bits, I would imagine that the OPUS decoder must even require more CPU to decode this.

On my experiment the laptop and the PC where connected via a 100 Mbit switch via LAN cable. So I think I did not have any network limitation. Since the ping time is very short, all protocol messages get through in almost no time. Maybe if the ping is larger, the protocol messages take longer and you get more influence of the protocol messages on your audio performance.

Any other ideas?

WolfganP commented 4 years ago

What type of dummy clients are you using? For my test I am using a separate Linux laptop and I use a special client.cpp file where I do not call the OPUS encoding/decoding but only send random encoded OPUS bits to the server. Otherwise my laptop would not be able to run such a large number of dummy clients. Maybe the type of coded data the OPUS decoder at the server gets influences it's CPU usage? But since I send random bits, I would imagine that the OPUS decoder must even require more CPU to decode this.

On my experiment the laptop and the PC where connected via a 100 Mbit switch via LAN cable. So I think I did not have any network limitation. Since the ping time is very short, all protocol messages get through in almost no time. Maybe if the ping is larger, the protocol messages take longer and you get more influence of the protocol messages on your audio performance.

Interesting... I've never thought about the client side on load testing a server. Your modified client is an smart way to test the server, but as it doesn't inject "real" audio is it really possible to evaluate the audio quality?

Would it be a good mod for a client as a load test tool for it to capture/process real audio once, and then behave as multiple clients? (ie send 100 registrations, encode the "real" audio once and send 100 audio packets as if they were coming from 100 diff clients, process/output the mix back for 1st client and discard the rest)

kraney commented 4 years ago

I built jamulus and jack into a docker container along with a config file that sets to mono in / stereo out. Jack is set to dummy audio. I find I can run about 80 of these per 8-core cloud instance, so I launch a couple of those to act as clients. I run them in the same zone as the server, so network should similarly not be an issue.

I connect with a normal client to check the audio quality. That one has about 13-20ms network latency. Since the dummy audio is silent, the quality really only reflects how my own input sounds once it comes back from the server. I've tried to find a way to have jack read a wav file in a loop as input or something, but haven't had much luck. Still, this seems to produce decent result and I have had the opportunity to compare to behavior of a server handling about a dozen real clients, and it seems to be reasonably similar.

Only speculating here, but random audio would not compress well at all while silence should compress extremely well. Maybe it's more expensive to decompress when it has been more thoroughly compressed? Silence could be a pathological case, compression algorithms can be weird that way.

Another possibility is that it's due to running as a VM in the cloud compared to running on bare metal. Hypervisor overhead could account for some of the difference. Also this is on Intel Skylake architecture, which is not the newest generation.

Also, I learned over the weekend my "nice" setting was not taking effect. I fixed this and I get an incremental performance boost as a result. With that I get about 76 clients on 4 cores. I can get nearly to 100 with 6 cores. I don't believe I'm getting realtime priority either, but I have not addressed that yet. There might be another incremental gain to be had from that.

If you want to experiment with how network latency affects the results, it should be possible with your existing setup using tc. https://bencane.com/2012/07/16/tc-adding-simulated-network-latency-to-your-linux-server/

corrados commented 4 years ago

I can get nearly to 100 with 6 cores.

Have you used the code on your branch or did you perform that test using the latest Git master code (as I did)?

corrados commented 4 years ago

@storeilly I can see that you still have the test servers for multithreading testing running. Could you update one of these servers with the latest Git code on the master branch so that I can do some further testing? Maybe with the latest code we could increase the number of clients on the server hardware you are using.

kraney commented 4 years ago

@corrados I haven't yet had time to try the current master, so the numbers I mention are based on my branch.

corrados commented 4 years ago

Your modified client is an smart way to test the server, but as it doesn't inject "real" audio is it really possible to evaluate the audio quality?

I think it is enough to check the audio quality of the single "real" client. So I set this to Solo and play my drums for testing. If the audio and latency is ok, I assume that this is the case for all other dummy clients, too.

@corrados I haven't yet had time to try the current master, so the numbers I mention are based on my branch.

It would be a good comparison, if you have time, to use the code on master on exactly the same hardware to see if we have a loss in the number of clients compared to your branch.

maallyn commented 4 years ago

Will it help you all if I temporarily rent a 4 CPU dedicated server in the Linode Newark data center, which is the same data center as my newark-music.allyn.com, whish is a two CPU dedicated server? This way, perhaps I can try to overwhelm it with clients that I automatically generate using the existing Jamulus client and a script to start them (using no graphics), but play music through them using audacity running through jack. If you folks think this is worthwhile, I am willing to pay for a few days' rental.

WolfganP commented 4 years ago

[...] with clients that I automatically generate using the existing Jamulus client and a script to start them (using no graphics), but play music through them using audacity running through jack. If you folks think this is worthwhile, I am willing to pay for a few days' rental.

I'll certainly be interested in the script :-) (I tried to generate clients getting audio fed by a player via dummy/virtual audio cards, but never could get it working properly) TIA @maallyn!

corrados commented 4 years ago

Will it help you all if I temporarily rent a 4 CPU dedicated server in the Linode Newark data center, which is the same data center as my newark-music.allyn.com

As far as I know, storeilly already has a 4 CPU server rented. So if he updates his server with the new code, that should be sufficient for testing. Thanks.

storeilly commented 4 years ago

@storeilly I can see that you still have the test servers for multithreading testing running. Could you update one of these servers with the latest Git code on the master branch so that I can do some further testing? Maybe with the latest code we could increase the number of clients on the server hardware you are using.

You got lucky, I had a few minutes :) Running now on port 22125

corrados commented 4 years ago

That's great, thanks :-). I am currently testing and it looks very good. I'll occupy 98 slots on that server right now. If anyone wants to join, just connect (but make sure you set yourself to solo since the dummy clients make a lot of noise.

corrados commented 4 years ago

I just had a short session with "Andrew" on the server. We had 99 clients and our audio was good. A little bit higher jitter but still ok to play. So the code from kraney really seems to make a difference. Thanks again :-). @storeilly I was using stereo dummy clients. So I can confirm that on your server hardware 100 clients can connect successfully. Can you please tell us the hardware specs of your server?

trombonepizza commented 4 years ago

That was me

maallyn commented 4 years ago

What I did was to set up jackd, qjackctl, jamulus, and audacity on the machine used for the clients. Then, using a shell script, I started the clients using something like this:

==========================================================

/bin/bash

for i in {1..46} do /home/maallyn/jamulus/Jamulus -j -n --connect 172.104.29.25 & done

Once that is done, then I have to go into jack and connect jack's source to each client and then go into audacity and connect audacity to jack's sink.

You cannot have a user session on Linux with more than 50 or so clients due to the limits of Jack. So I have two user accounts on my server; each one kicking off 50 clients for a total of 100.

jamulussoftware / jamulus

Server performance & optimization #455

/bin/bash

for i in {1..46} do /home/maallyn/jamulus/Jamulus -j -n --connect 172.104.29.25 & done