Open Johni0702 opened 5 years ago
This is a heroic task. I've been idly wondering whether mumble or webrtc are the best choices into the future, and here you are gluing them together. This is amazing my hats off to you.
@Johni0702 if I understand this correctly you have started working on this feature in a branch of yours. What's the status of it?
@Krzmbrzl All I have done is listed under the "Proof of concept" section, i.e. a mumble-web version which uses WebRTC instead of UDPTunnel messages and a proxy which converts between WebSocket+WebRTC and TCP+UDPTunnel (assuming its running on the same machine as the server). IIRC, at time of writing of the issue, both of those were working as well as the normal mumble-web version (though haven't really had much testing at all).
I have not touched Murmur cause I was not (and am still not) particularly familiar with C++, especially given how much network-facing code would be involved.
Okay thanks for the update :+1:
@Krzmbrzl Also interesting, grumble seems to be capable (or maybe have been capable) of using mumble-web without the proxy: https://github.com/mumble-voip/grumble/issues/33
This is really an interesting project, I planned to set it up for a friend who does not like to fiddle around with installing etc. But never tried it, also due to the still missing config support of grumble See: https://github.com/mumble-voip/grumble/pull/26
Update: Just read there seems to be a difference between the html5 version of mumble-web and a new webrtc branch:
Note: This WebRTC branch is not backwards compatible with the current release, i.e. it expects the server/proxy to support WebRTC which neither websockify nor Grumble do.
WebRTC is interesting because it has a better echo-cancel feature then the old one in Mumble based on speex. With Pulseaudio you can test that feature by $ pactl load-module module-echo-cancel
and it creates sink and source in PA. After testing it, it was as good as Mumble 1.1.x with 6 channels and output to Center (disabled Pos. Audio) and enabled Echocancelation (at least with ASIO).
We need the echo-cancel feature of WebRTC because it works in more than 1 channel. By the way as soon as output to Stereo Speakers Mumbles echo-cancel get worse.
https://forum.freifunk-muensterland.de/t/mumble-script/3695/14
WebRTC is interesting because it has a better echo-cancel feature then the old one in Mumble based on speex.
Not necessarily true. As it turned out the echo cancellation in Mumble was broken. It is going to be fixed by #4167
Echo chancel (as in audio processing) is (in theory) independent from WebRTC (as in network protocol). WebRTC does not provide echo cancellation. Of course a Mumble web client that runs in Chrome can benefit from Chrome's echo cancellation and noise suppression (by the webrtc library), but it shouldn't be the reason to implement WebRTC (the protocol).
The reason to implement WebRTC: there is already a working web client (https://github.com/Johni0702/mumble-web) and it would be nice to have WebRTC support out-of-the-box without a special proxy in-between.
This is an extension to #2131 (WebSocket for the control channel) in an effort to properly support purely browser-based Mumble clients.
Motivation
While WebSocket support alone would already allow for fully functional browser clients by tunneling voice over the control channel, the resulting implementation in the client is a huge mess (including but not limited to compiling the codec libraries to JavaScript, abusing ScriptProcessorNodes and most importantly hoping for the scheduler and GC to not get in the way). While the first of those just results in large JS blobs and bad performance, the latter two can (and will, if the system is under load) cause the audio to be randomly interrupted or delayed by as much as multiple seconds. It mainly boils down to the fact that handling real-time stuff in mostly pure JS on an ordinary web page is not a good idea.
Using WebRTC solves all of the above by pushing all of the real-time data handling to the browser. If all clients use the Opus codec, this can even be done in a backwards compatible way.
Overview of the relevant WebRTC internals
When talking about WebRTC, the part that is relevant to Mumble is mostly the protocols used and less the JavaScript APIs which one usually refers to. In this particular case those are STUN, ICE (not to be confused with the RPC lib used in Murmur, that's a completely unrelated thing), DTLS, SRTP and RTP (layered in that order).
RTP
https://tools.ietf.org/html/rfc3550 https://tools.ietf.org/html/rfc7587 This is the upper most layer used in WebRTC when transmitting real-time data (e.g. audio). RTP packets are usually sent over UDP (with some additional layers in between) and are in many aspects similar to the voice packets used by Mumble. Each data source is identified by its SSRC (Synchronization SouRCe) (similar to the session id in Mumble). A packet carries an SSRC, a timestamp (unit depends on codec, for Opus it's in samples, i.e. 48k/s), a sequence number, the actual data (e.g. audio) and other, less relevant data.
RTP has a companion protocol named RTCP which is used for transmitting metadata about SSRCs and reporting packet loss among other things but it's mostly irrelevant to Mumble (until Mumble does video).
SRTP and DTLS
https://tools.ietf.org/html/rfc3711 https://tools.ietf.org/html/rfc5764 The SRTP layer provides encryption and authentication to RTP packets. DTLS (TLS for UDP) is only used for the handshake and to establish key material for SRTP. One important conceptual difference between SRTP and Mumble's UDP crypto is that SRTP derives the key used for a particular packet from its SSRC and its sequence number whereas Mumble uses only the index of the packet which is dependent on neither the source of the packet nor the sequence number in that voice transmission. (Small detail: since the sequence number in RTP packets is only 16 bits, the SRTP implementation maintains an internal roll over counter which is also used in determining the key used)
A result of that difference is that some cryptographic information (e.g. replay list) needs to be retained for each SSRC for the whole session, so the number of used SSRCs should be kept low. (Another reason for keeping the number of used SSRCs low is that WebRTC mandates that MediaStreams cannot be removed, only set to inactive.)
ICE and STUN
These are used to establish an connection between two peers through NATs. In the Mumble case, one of the two peers is the server which needs to be publicly reachable anyway, so NATs shouldn't be much of a problem.
Proposed protocol changes
Unsurprisingly a few extensions to the Mumble protocol are required to use WebRTC as the voice transport.
To indicate support for WebRTC, a new field is added to the
Authenticate
messages:If the server supports WebRTC and the client indicates its support with above flag, then the server must send initialization data for the WebRTC connection (similar to the
CryptSetup
message) before completing the connection via aServerSync
and before sending anyUserState
packets. This allows WebRTC-only clients to recognize old servers which do not support WebRTC and show an error message to the user.SDP vs bare minimum
When building an application on top of WebRTC, one usually passes SDP messages between participating peers. From the application's point of view SDP messages are just blobs of data which are used by WebRTC to negotiate transport, codec and other settings. However, IMO the better approach for Mumble is to only pass the minimally required amount of information (i.e. the fingerprint of the dtls server and some data for ICE) and let the client construct the SDP itself (if it even needs to). The main reasons I'm against passing whole SDP messages are:
The proposed initialization data referred to in the previous segment would therefore look as follows:
Additionally (and this is the case with both approaches), ICE candidates need to be exchanged between client and server (these contain addresses and ports for the client and server to find each other at and to use for RTP passing):
Mapping SSRCs to users
As opposed to session ids, the total amount of SSRCs used should be kept low. As such a new field should be added to the
UserState
message which contains the SSRC used for the user:Alternatively, the requirements on the
session
id could be changed to conform to the requirements on SSRC values. I'm not sure whether there are any other requirements which would conflict with the SSRC ones, so I've kept them separate for now. (I've also kept them separate for a practical reason: it doesn't require you to do session id re-mapping when building a proxy.)SSRC 0 should be reserved for the client to be used when sending audio to the server (server loopback would then return on the SSRC indicated in the client's own UserState message). Note: The proof of concept implementation currently uses a random SSRC for serverbound audio which works as well until it randomly chooses a low SSRC and collides with one of the other users' SSRCs (it was just easier to implement).
Indicating talking state
Mumble voice packets contain a
target ID
field, i.e. for client-bound packets: normal, whisper, shout, loopback and for server-bound packets: normal,VoiceTarget
, loopback There is no equivalent in RTP short of allocating multiple SSRCs for each user. Additionally, RTP has no equivalent for the last packet marker and there's no good way to determine whether a user is currently talking via the JavaScript API. The solution here is to move talking state indication into the control protocol (at least for WebRTC clients):When sent by the server, these are purely for display in the UI and should not influence audio processing in any way. When sent by the client, these indicate its intent to start/stop talking and the server should subsequently start/stop handling the RTP packets from the client (RTP packets might even be sent when the user isn't talking, though the client can make sure those contain silence only). A delay of these messages might result in some packets missing at the beginning of a user's own voice transmission, however this shouldn't be much of a problem as those packets would probably have been lost anyway if Mumble's Voice over UDP was used instead.
Note that this requires the server to track the current talking state of each user which it currently doesn't do (afaik). Doing so shouldn't require much processing power though and is requires anyway (see following point about Mumble to RTP translation).
Some further details
Translating Mumble Voice transmissions to RTP streams
An RTP stream is more persistent than a voice transmission in Mumble in the sense that a single transmission last from pressing the PTT key to releasing it whereas an RTP stream will last from initial connection of the user until they disconnect. This is requires as there is no quick way to add new RTP streams on demand whenever a user starts talking without using the control channel and introducing delays or loss of packets.
As such, multiple consecutive Mumble voice transmissions by the same user need to be stitched one after the other into the same RTP stream. The only thing to watch out for is that no huge jumps in RTP sequence number occur as that can cause the crypto to get out of sync. Other than that, this is rather easy to implement by just keeping an RTP sequence number offset and adding the Mumble sequence number on top. This will also transparently pass on any jitter in the packets. Note that Mumble's sequence numbers do not have to start at 0 though, so an additional offset needs to be kept to prevent huge jumps in resulting RTP sequence numbers.
Since the RTP timestamp for Opus is just the amount of samples passed, it can simply be calculated as
480 * rtp_seq_num
. If the marker bit in the RTP header is set for the first RTP packet in each transmission, the client will deal alright with the discontinuity.For a POC implementation in Rust, see here.
Translating RTP streams to Mumble Voice transmissions
This is far simpler than the other way around. The server merely has to store the current talking state and the rtp offset when the user started talking (
TalkingState
message) and can then convert from RTP to Mumble as one would expect.For a POC implementation in Rust, see here.
Positional audio
As far as I am aware, WebRTC does not support positional audio. While RTP provides support for extending its header, WebRTC only supports a specific set of them and none provide anything like positional audio. So for now, positional audio will not be supported.
Multiple voice streams
While multiple outgoing streams for one client are technically supported by the Mumble protocol, I see no use case aside from bots which whisper different things to different people and those probably shouldn't be put in the browser. Since such bots continue to exist, this must be kept in mind when implementing the transmission tracking on the server (should be tracked per target+user, not just per user). RTP does not support multiple streams for one SSRC and as such only one stream at a time can be received per user and only one at a time can be sent by the client (I believe this matches the behavior of the native Mumble client, not entirely sure though).
Other codecs
It might be possible to support other codecs like CELT and Speex if the browser has support for them (RTP can support different codecs). I haven't yet looked into that though.
The POC only supports Opus and always assigns it the RTP payload type 97. If multiple codecs were to be supported, the proper way to do so would probably involves indicating support of and assigning payload type for each Codec in the
WebRTC
message or using the current mechanism of indicating codec support and defining fixed payload types for each Codec (as is done with Mumble's UDP voice protocol).Proof of concept
Mumble/TLS/TCP to WebRTC/WebSocket/TLS proxy: https://github.com/Johni0702/mumble-web-proxy WebRTC support in mumble-web: https://github.com/Johni0702/mumble-web/tree/webrtc Lib used by mumble-web (the WebRTC happens here): https://github.com/Johni0702/mumble-client/tree/webrtc Demo: https://voice.johni0702.de/webrtc/?address=voice.johni0702.de&port=443/demo Also be aware that the comments in the
.proto
files (especially the ones about SDP) used in the POC might be out of date.