mumble-voip / mumble

Mumble is an open-source, low-latency, high quality voice chat software.
https://www.mumble.info
Other
6.31k stars 1.11k forks source link

Ice performance #2779

Open HarpyWar opened 7 years ago

HarpyWar commented 7 years ago

~2000 users connections and ~1000 virtual Mumble servers are running on this machine with Debian 8, and first CPU core mostly 100% usage when other cores not. This leads to the situation when part of virtual servers are lagging (very high delay on any actions inside a server and voice interrupt by waves). How it can be fixed?

image

mkrautz commented 7 years ago

Hi, what version of Murmur are you running, and which database backend?

For the current versions of Murmur, all virtual server instances share the main thread for TCP/TLS traffic, but voice traffic has its own thread per-vserver. Because of this reality, it might help for you to shard your server into multiple murmurd instances. That way, you'll most likely get better use of the multiple cores.

HarpyWar commented 7 years ago

Murmur 1.2.16 with SQLite, disk SSD. Database size of 150 mb (logging to database is disabled).

Ice connections also handled by the main thread? I have noticed that sometimes there is high delay when user switching between channels and Ice script working at the same time. Is it possible to set higher priority to voice threads over tcp/tls?

mkrautz commented 7 years ago

Ice connections are accepted in a separate thread, but the execution of the RPC call is serialized into the main thread via a Qt event. What kind of Ice RPCs are you executing, and how often?

Murmur will try to make any voice threads SCHED_FIFO: https://github.com/mumble-voip/mumble/blob/91ebb8b0b5f81543eed5be1c99d9265b1376c633/src/murmur/Server.cpp#L266-L277

mkrautz commented 7 years ago

BTW, why is the binary called murmurd-public? Is it a self-built Murmur? Or a backport from testing? I ask because it'd be helpful if you can get perf data from your Murmur instance.

Something like:

# perf record -F 99 -a -g -p $(pidof murmurd) -- sleep 30       # (as root)

(adjust sleep 30 to set how many seconds to preform the recording... you know better how to get the best indication of the performance problems...)

Obviously, I can only really use the recorded data if our systems have the same debug symbols, which probably won't be the case for a self-built Murmur.

HarpyWar commented 7 years ago

It seems the problem was on my server. After moving the VPS to another node the problem has gone. First CPU core usage is the same as on first screenshot, but I do not see perceptible lags now. murmurd-public is just a renamed murmurd.

In relation to Ice — all virtual servers are lagging on any write RPC action (voice is not lagging, cause, as you said, its handling in different threads, but switching between channels - yes). For example, updating slots or motd several times in sequence will result to this situation. RPC read actions no not affect on a server performance. BUT if Ice client is running on the same machine as a server you will not see any delays! The more ping from Ice client to a server, the more delay on the whole server instance on every RPC call (only write action). Processor is not highly loaded at this time, so it should be an issue inside a code of Murmur (or may be Ice library).

Kissaki commented 7 years ago

Do you want to close this then? Or is there still something open? Trying to improve your performance?

HarpyWar commented 7 years ago

So can anyone research why RPC write action makes server lagging only when Ice client connected remotely? Can it be improved?

mkrautz commented 7 years ago

@HarpyWar Since only write RPCs are affected, my guess is that fsync()/fdatasync() calls from SQLite might be causing the main thread to block.

mkrautz commented 7 years ago

Perhaps investigate by stracing the murmurd main thread pid?

mkrautz commented 7 years ago

I'll try to do a small patch that allows you to use SQLite with WAL. That should improve things.

HarpyWar commented 7 years ago

The described problem can be repeated on any remote server, not only mine. Even on a fresh installation with a single virtual server:

  1. Connect from Ice client to a server instance remotely.
  2. Select first virtual server inside the Ice script and process iteration 1000 times for update server slots to or motd to any fixed value. (1000 times is just for test - in real even a single write action will cause a short lag which can be noticeable). You can also run 10-15 the Ice script instances to emulate more connections as on real server.
  3. While Ice script is running, connect to the server from your Mumble client and start fast switching between channels. You will see noticeable delays from 1 second to sometimes 10. Try reconnect - it will be much longer than before.

I tried on several different machines. Delays appear even if ping from Ice client to a server is minimal (3 ms). If ping 50 or 100 ms then delays can be very large. And there are no delays if Ice client and a server are both running on a local computer.

mkrautz commented 7 years ago

Ah, sorry... Missed the part about local vs remote. Interesting data point, thanks!

mkrautz commented 7 years ago

WIP PR for using SQLite WAL: https://github.com/mumble-voip/mumble/pull/2794/files

....Though it seems like your issue is not that. I won't have time to investigate the Ice problem tonight, sorry.

Kissaki commented 7 years ago

Which Ice functions are you talking about specifically?

Callbacks are blocking, so if a user were trying to enter a channel, but denied by the Ice script, the client has to wait for the servers answer whether entering was successful. The Mumble server in turn calls the remote ice script, which responds, and only then does the Mumble server respond.

Are the delays higher than the expected transmission of data calling the remote script, its execution, and answering back? Could it be network or network scheduling delays?

mkrautz commented 7 years ago

Is the issue here that multiple RPC calls block forward progress for other events? I assume it would help if we were smarter about scheduling ExecEvents... For example, perhaps ExecEvents should have lower priority than normal Qt events, to allow server-work to be done in between invocations. (I don't know if that is how priorities in Qt's event handling works).

I believe I've come to this conclusion before, maybe in another thread from @HarpyWar? :)

mkrautz commented 7 years ago

Here's a proposed fix for that issue: https://github.com/mumble-voip/mumble/pull/2802/files

Untested as of yet, but I belive it should improve things.

HarpyWar commented 7 years ago

@Kissaki you wrote

Callbacks are blocking, so if a user were trying to enter a channel, but denied by the Ice script, the client has to wait for the servers answer whether entering was successful. The Mumble server in turn calls the remote ice script, which responds, and only then does the Mumble server respond.

But there is an opposite info in Murmur.ice:

Please note that all callbacks are done asynchronously; murmur does not wait for the callback to complete before continuing processing.

The above described issue is not related to callbacks, but now I'm trying to add a callback (without a heavy code, even on empty callback functions) and it's blocking and extremely slowing any user actions (Murmur 1.2.9 and earlier). So, there is wrong info about async callbacks in Murmur.ice? Actually callbacks are useless if they are syncronius, isn't it?


BTW, also I try to add ServerCallback with the Murmur 1.3.0\~2389\~gdde8173\~snapshot on Windows, and a server crashes on user move (or other action) without an additional info in logs:

<W>2017-05-20 13:49:00.246 1 => Added Ice ServerCallback 2436dafb-a608-445d-b537-18ebab9c73b9 -o:tcp -h 169.254.173.224 -p 1436 -t 5000:tcp -h 192.168.1.5 -p 1436 -t 5000
<W>2017-05-20 13:49:42.249 1 => <114:test(-1)> Moved test:114(-1) to Public #5[5:0]

UPDATE Callbacks with Murmur 1.3.0 doesn't crash a server on Linux. Only on Windows. So it should be a bug somewhere in Murmur source code.

HarpyWar commented 7 years ago

@mkrautz I just compiled the latest Murmur on Linux with your pull request https://github.com/mumble-voip/mumble/pull/2802 and also without it - to compare results. Unfortunately, it does not change anything, at least I could not found any difference with my tests.

But my bad, "local vs remote" was incorrect testing. It was tested on my local computer where is high performance SSD, and compared with remote VPS where SSD with a smaller IO performance, and several servers with HDD. Because of that I thought that this issue is due to a remote connection. But actually the issue also depends on a disk capability and can be reproduced on a local computer without users (if murmur.sqlite is on SSD it works faster, but on HDD - slower). Also it depends on how many users are connected to Murmur instance. More users more lags. And the same statement can be applied for ice connections, if you have not a server with many users online for testing (more ice connections with write operations more lags). Your patch https://github.com/mumble-voip/mumble/pull/2794 with setting sqlite_wal=1 increased performance a little for Ice write operations. May be twice or thrice, but not too much, so it doesn't resolve the issue as full :(

I wrote a simple testing utility for Murmur Ice, which updates slots to a random value for a first selected virtual server from multiple threads. Anyone can test a new Murmur 1.3.0 server with it MurmurIceTest.zip Usage: MurmurIceTest.exe [murmur address] [ice port] [ice secret] {threads} Example: MurmurIceTest.exe 127.0.0.1 6502 youricesecret 100


Also the same issue with mysql. There are the same freezes as with sqlite and even worse. When mysql disk usage is very small, unlike sqlite. With a local mysql server it's mostly OK. But with a remote mysql server switching between channels can be extremely slow (even with 1-2 threads in the test utility you can feel these annoying freezes). Up to a full server unavailability when users can't connect with an error Server is not responding to TCP pings..

HarpyWar commented 7 years ago

After playing workaround the issue with ice callbacks, I found the following problems. I also wrote additinal utility to test ice callbacks if someone want to do it again. The program handles all first virtual server callbacks and when event raised just display it in console MurmurIceCallbackTest.zip Usage: MurmurIceCallbackTest.exe [ice address] [ice callback address] [ice port] [ice secret] [murmur version 1.2.9|1.3.0]


  1. Murmur 1.3.0 running on Windows 10, run utility on the same computer (any valid ice client callback address: 0.0.0.0 or 127.0.0.1). Murmur.exe crash on any user action (connect, join channel, etc.).

    MurmurIceCallbackTest.exe 127.0.0.1 127.0.0.1 6502 myicesecret 1.3.0

    I tried run another ice client called Mumo on Linux to connect to Murmur 1.3.0 that is running on Windows. But Mumo could not connect and hangs up, and Murmur server is not crashed.

  2. Murmur 1.2.9 running on Windows 10, run utility on the same computer, ice client callback address 127.0.0.1. Sometimes, on any user action, server may freezes up to 5 seconds, but normally it starts and works without freezes. If close utility by force - there is always a delay up to 5 seconds when switching channels.

    MurmurIceCallbackTest.exe 127.0.0.1 127.0.0.1 6502 myicesecret 1.2.9
  3. Murmur 1.2.9 running on Windows 10, run utility on the same computer, ice client callback address 0.0.0.0. On any user action server freezes up to 5 seconds or even with disconnect from server with an error "Server is not responding to TCP pings". Very seldom can switching channels without freezes. This is what I talked about to @Kissaki in my message above.

    MurmurIceCallbackTest.exe 127.0.0.1 0.0.0.0 6502 myicesecret 1.2.9
  4. Murmur 1.3.0 running on remote Linux, ice client callback address 0.0.0.0. Works fine, at least on empty server. If close utility by force - there is always a delay up to 10 seconds when switching channels.

    MurmurIceCallbackTest.exe x.x.x.x 0.0.0.0 6502 myicesecret 1.3.0
  5. If not specify a timeout -t in ice client proxy string for communicator.createObjectAdapterWithEndpoints, then Murmur server can freeze forever with an error "Server is not responding to TCP pings" (in the utility I set a timeont 1000). I tried on Linux, and server process can be killed only with kill -KILL murmurd in this case.

  6. Also I found that it's not possible to add ice callback if connect from a machine under a router with NAT (ice client callback 0.0.0.0). Not only ice client on Windows, but also on Linux. I tried rent a VPS in Hetzner, it has not an external IP address. There I installed Mumo, add callback_host = 0.0.0.0 in mumo.ini and run it to connect to my home computer where Murmur is running. On first user action server freezes up to 5 seconds and in murmur.log I see that callback failed:

    <W>2017-05-22 01:01:23.596 1 => Added Ice ServerCallback 2a075bc1-f7e0-45c8-b543-87e2086e8c6e -o -e 1.0:tcp -h 169.254.173.224 -p 1771 -t 1000:tcp -h 192.168.1.5 -p 1771 -t 1000
    <W>2017-05-22 01:01:28.882 1 => <6:tester(-1)> Moved tester:6(-1) to AFK[6:0]
    Ice ServerCallback 2a075bc1-f7e0-45c8-b543-87e2086e8c6e -o -e 1.0:tcp -h 169.254.173.224 -p 1771 -t 1000:tcp -h 192.168.1.5 -p 1771 -t 1000 failed
    Removed Ice ServerCallback 2a075bc1-f7e0-45c8-b543-87e2086e8c6e -o -e 1.0:tcp -h 169.254.173.224 -p 1771 -t 1000:tcp -h 192.168.1.5 -p 1771 -t 1000

Conclusion

Issue 1 is a bug in Murmur on Windows.

Issues 2-5 should be related only to Windows Ice implementation / or buggy .NET slice implementation. But is there any chance that the bottleneck is in Murmur, that handles calls from .NET code not properly? For example like this issue https://github.com/mumble-voip/mumble/issues/1874 (which was successfully fixed). I could not reproduce these lags with Mumo on Linux - Mumo works mostly perfectly, but there are also small lag spikes on a real server with users (tested with one enabled deaftoafk plugin). I checked my .NET code and it does exactly the same things to add callbacks, as in Mumo.

Issue 6 related to any ice client that uses callbacks. Is it possible to bypass NAT? May be it's needed a specified ice client configuration?

HarpyWar commented 4 years ago

The issue with delay on switching channels and disconnects with error Server is not responding to TCP pings. still exists, when executing actions through Ice (Debian on dedicated server with the latest compiled Murmur). Is Ice deprecated and it's better to use gRPC?

Krzmbrzl commented 4 years ago

I don't have a clue about ice, so there's not much I can say about the problems with it.

Is Ice deprecated and it's better to use gRPC?

In 1.3 gRPC is not enabled by default as the implementation is still considered experimental. However #3947 is overhauling the gRPC interface so I my guess it that it'll be included in 1.4

HarpyWar commented 4 years ago

gRPC uses same RPC calls like Ice with blocking main thread?

Krzmbrzl commented 4 years ago

Nu clue xD

/cc @McKayJT @Kissaki

McKayJT commented 4 years ago

gRPC attempts to not block the main thread. The current implementation does this by calling the main event loop manually during an action that would need to block, but defers processing of new events until completion. My implementation that is still under work does this by using the boost fibers library to prevent blocking the thread.

In both implementations the actual work is done in the main thread, except for the network communications. The current design is very similar to ICE in that it uses RPCExec events to force the implementation code to run in the main thread. The fiber-based design injects itself into the main thread event loop using a timer, calling the gRPC related code after all the current events have been processed.