Closed gkrol-katmai closed 7 months ago
Do you know how to check the byte size of the BIO_s_mem?
BtW thanks for the report. Don't hesitate to write that PR and we will investigate, or perhaps we'll just revert the offending PR if we have no idea of how it works.
I tested with Valgrind / Massif, and I could see that it was OpenSSL memory that was growing. I did not manage to get a sensible number of the mem BIO size from the worker itself, but I hacked a printf into the OpenSSL code directly and I could see the buffer growing and growing.
This is what I put in bss_mem.c:mem_write, after blen = bbm->readp->length;
if (blen > 0) printf("[mem_write] %p blen: %d + %d\n", b, blen, inl);
I then verified that the pointer I was seeing corresponded to sslBioToNetwork
The Mediasoup worker build process is really nice btw, and also easy to use the built worker.
I apologise! I got the openssl documentation wrong. @ibc let me know if you want me to do a PR reverting to the old behaviour.
@pnts-se don't worry, I'm doing it. Thanks a lot.
PR https://github.com/versatica/mediasoup/pull/1342 merged. Thanks a lot.
Thanks for responding to this so quickly!
@pnts-se @gkrol-katmai, I'm investigating the DTLS MTU issue and I have some questions or doubts (not sure). Could you please participate if you wish? I'm commenting in the original issue:
https://github.com/versatica/mediasoup/issues/1100#issuecomment-1959002237
Wow, I have a working PR but I cannot beliete it!
@pnts-se @gkrol-katmai @ggarber can you please take a look to it? https://github.com/versatica/mediasoup/pull/1343
but I hacked a printf into the OpenSSL code directly and I could see the buffer growing and growing.
This is what I put in bss_mem.c:mem_write, after
blen = bbm->readp->length;
if (blen > 0) printf("[mem_write] %p blen: %d + %d\n", b, blen, inl);
I then verified that the pointer I was seeing corresponded to sslBioToNetwork
I've done this (in the reverted PR that caused the leak) and I only gets those prints at the beginning during the DTLS handshake , then no more despite I send and receive DataChannel messages. Do I miss something?
I don't know anything about OpenSSL, but I could get the problem to go away by placing (void)BIO_reset(this->sslBioToNetwork); at the end of DtlsTransport::SendDtlsData. However, I do have some doubts on what is going on with the callback, as the documentation says it's called twice for every operation, and perhaps we're sending the data twice now?
Here is very clear and I've verified that, during a write operation, the callback is first called with oper == BIO_CB_WRITE
(there we call dtlsTransport->SendDtlsData()
, and after it the same callback is called with oper == (BIO_CB_RETURN | BIO_CB_WRITE)
so we don't enter the condition and don't send any duplicate data. This was ok.
@ibc if you remove the if
do you get the messages with a zero blen for the sslBioToNetwork BIO?
If the problem doesn't occur for you, I'd be very curious where in this case that buffer is emptied. I could not find any location in either the OpenSSL or the Mediasoup code that would do that. I believe you either need to do a real read or you need to reset.
Good to know that this wasn't sending data twice!
if you remove the
if
do you get the messages with a zero blen for the sslBioToNetwork BIO?
Nope.
This is "full" logs with some extra logs I've added in DtlsTransport
. Scenario is that I run mediasoup (immediately some of your logs are shown), then a client joins a room, so there are more of your logs, then I send a DataChannel message (@bot laskdjklajdkas
) via the mediasoup-demo chat input, so it's sent to the server and then sent back to same client. Your logs don't show up in that case:
And those are same logs if I call BIO_reset(this->sslBioToNetwork)
at the end of DtlsTransport::SendDtlsData()
method:
In next comment I will only show your [mem_write]
logs affecting sslBioToNetwork
.
mem_write logs with point this->sslBioToNetwork
without calling BIO_reset(this->sslBioToNetwork)
:
**** [mem_write] 0x60000035c240 blen: 0 + 1349 +0ms
**** [mem_write] 0x60000035c240 blen: 1349 + 1349 +0ms
**** [mem_write] 0x60000035c240 blen: 2698 + 1349 +0ms
**** [mem_write] 0x60000035c240 blen: 4047 + 1349 +0ms
**** [mem_write] 0x60000035c240 blen: 5396 + 1349 +0ms
**** [mem_write] 0x60000035c240 blen: 6745 + 242 +0ms
mem_write logs with point this->sslBioToNetwork
calling BIO_reset(this->sslBioToNetwork)
:
...
And I couldn't get those logs because... Honestly I don't know what I'm doing. I don't see those mem_write
logs always. In fact they do NOT happen in most of cases. Here for example I've restarted the browser tab running mediasoup-demo so previous mediasoup Router is closed and a new one is created, new WebRtcTransport, etc. And I do NOT see those mem_write
logs:
RTC::DtlsTransport::DtlsTransport() | ------- this->sslBioToNetwork address: 0x600000330240
mediasoup:Channel request succeeded [method:ROUTER_CREATE_WEBRTCTRANSPORT_WITH_SERVER, id:7] +0ms
mediasoup:Transport constructor() +65ms
mediasoup:WebRtcTransport constructor() +0ms
mediasoup:Transport enableTraceEvent() +0ms
mediasoup:Channel request() [method:TRANSPORT_ENABLE_TRACE_EVENT] +3ms
mediasoup:Channel request succeeded [method:TRANSPORT_ENABLE_TRACE_EVENT, id:8] +0ms
mediasoup:Transport consumeData() +4ms
mediasoup:Channel request() [method:TRANSPORT_CONSUME_DATA] +4ms
mediasoup:Channel request succeeded [method:TRANSPORT_CONSUME_DATA, id:9] +1ms
mediasoup:DataConsumer constructor() +0ms
mediasoup:WebRtcTransport connect() +16ms
mediasoup:Channel request() [method:WEBRTCTRANSPORT_CONNECT] +11ms
mediasoup:Channel request succeeded [method:WEBRTCTRANSPORT_CONNECT, id:10] +0ms
mediasoup:Channel [pid:45028] RTC::IceServer::HandleTuple() | transition from state 'new' to 'connected' [hasUseCandidate:false, hasNomination:false, nomination:0] +4ms
mediasoup:Channel [pid:45028] RTC::WebRtcTransport::OnIceServerSelectedTuple() | ICE selected tuple +1ms
mediasoup:Channel [pid:45028] RTC::WebRtcTransport::OnIceServerConnected() | ICE connected +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 1349 bytes of DTLS data ready to be sent +1ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 1349 bytes of DTLS data ready to be sent +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 1349 bytes of DTLS data ready to be sent +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
cmd> RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 1349 bytes of DTLS data ready to be sent +235ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 1349 bytes of DTLS data ready to be sent +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 242 bytes of DTLS data ready to be sent +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | DTLS timer set in 99ms +0ms
mediasoup:Channel [pid:45028] RTC::IceServer::HandleTuple() | transition from state 'connected' to 'completed' [hasUseCandidate:true, hasNomination:false, nomination:0] +0ms
mediasoup:Channel [pid:45028] RTC::WebRtcTransport::OnIceServerCompleted() | ICE completed +1ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | DTLS timer set in 99ms +0ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | DTLS timer set in 99ms +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 67 bytes of DTLS data ready to be sent +2ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 129 bytes of DTLS data ready to be sent +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | timeout is 0, calling OnTimer() callback directly +1ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::OnTimer() | handshake is done so return +0ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | timeout is 0, calling OnTimer() callback directly +0ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::OnTimer() | handshake is done so return +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 385 bytes of DTLS data ready to be sent +1ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | timeout is 0, calling OnTimer() callback directly +0ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::OnTimer() | handshake is done so return +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 85 bytes of DTLS data ready to be sent +1ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | timeout is 0, calling OnTimer() callback directly +0ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::OnTimer() | handshake is done so return +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == BIO_CB_WRITE
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SendDtlsData() | 45 bytes of DTLS data ready to be sent +0ms
RTC::DtlsTransport::onSslBioOut() | ----- operationType == (BIO_CB_RETURN | BIO_CB_WRITE)
mediasoup:Channel [pid:45028] RTC::DtlsTransport::SetTimeout() | timeout is 0, calling OnTimer() callback directly +0ms
mediasoup:Channel [pid:45028] RTC::DtlsTransport::OnTimer() | handshake is done so return +0ms
I've also added this at the top of DtlsTransport::SendDtlsData()
:
long bioRead;
char* bioData{ nullptr };
bioRead = BIO_get_mem_data(this->sslBioToNetwork, &bioData);
MS_DUMP("---- BIO_get_mem_data(this->sslBioToNetwork): %ld", bioRead);
And it absolutely always returns 1:
RTC::DtlsTransport::SendDtlsData() | ---- BIO_get_mem_data(this->sslBioToNetwork): 1
So honestly I don't understand absolutely anything.
Docs:
`BIO_get_mem_data() returns the total number of bytes available on success, 0 if b is NULL, or a negative value in case of other errors.
If the problem doesn't occur for you, I'd be very curious where in this case that buffer is emptied
To be very clear about this, it's not that the issue doesn't occur for me. It's that I have no idea how to know if it is happening because the log you added in OpenSSL prints at completely unexpected times or doesn't not print at all (not even if 0 because I removed the if condition).
Fascinating problem!
If you don't see the writes after removing the if
when sending normal messages, it's possible that this is a completely different code path. Could it be that control messages, like the certificate and ACK packets are sent over a different channel from the actual data?
For me, BIO_get_mem_data(this->sslBioToNetwork)
has also always returned 1, confusingly enough. And that also didn't make sense to me, so that's why I added that printf into the openssl code. That's the only place where I could actually see the length of the buffer increase.
Any plans to switch to a different SSL library by any chance? I'm not very impressed by the OpenSSL API and the code quality. And they've got a bad reputation with a lot of security problems.
I don't think I've got time today, but I may look into this some more next week. I'm very curious about what's going on!
If you don't see the writes after removing the
if
when sending normal messages, it's possible that this is a completely different code path. Could it be that control messages, like the certificate and ACK packets are sent over a different channel from the actual data?
Maybe.
Any plans to switch to a different SSL library by any chance? I'm not very impressed by the OpenSSL API and the code quality. And they've got a bad reputation with a lot of security problems.
Honestly not in the short term. In the past I did some research and couldn't find anything that could be used for proper DTLS+WebRTC (missing features such as SRTP keys negotiation over DTLS or DTLS timeout capabilities), so I gave up.
I don't think I've got time today, but I may look into this some more next week. I'm very curious about what's going on!
I'd appreciate if we could have a proper way to verify the leak so, having that, we can validate possible fixes. Right now I have a local branch which implements the PR that was reverted plus the BIO_reset(this->sslBioToNetwork)
at the end of SendDtlsData()
, but if we don't really know whether that fixes the problem or not then it's not good.
@gkrol-katmai, add fflush(stdout);
after your printf
in bss_mem.c
and it will always print when expected.
Ok, so I can confirm that by calling BIO_reset(this->sslBioToNetwork)
at the end of DtlsTransport::SendDtlsData()
(in the code of this PR https://github.com/versatica/mediasoup/pull/1156), the leak goes away.
However I'm not confident. Are we completely sure that we must clear/free/release all the content in this->sslBioToNetwork
BIO mem after DtlsTransport::SendDtlsData()
has been called triggered by the OpenSSL onSslBioOut
callback with BIO_CB_WRITE
operation?
I mean, imagine that OpenSSL wants to send 2 DTLS messages (or 2 DTLS message fragments), first one is 1500 bytes long, second one is 500 bytes longs. Somehow OpenSSL writes both together in the BIO mem, however it invokes the onSslBioOut
with BIO_CB_WRITE
twice sequentially, one with len: 1500
and another one with len: 500
.
If we follow the logic above and call BIO_reset(this->sslBioToNetwork)
at the end of DtlsTransport::SendDtlsData()
, then we are releasing 1500+500 bytes from the BIO mem rather than just the 1500 bytes that we have already sent to the endpoint. How can we know if that can happen or not?
I mean, imagine that OpenSSL wants to send 2 DTLS messages (or 2 DTLS message fragments), first one is 1500 bytes long, second one is 500 bytes longs. Somehow OpenSSL writes both together in the BIO mem, however it invokes the
onSslBioOut
withBIO_CB_WRITE
twice sequentially, one withlen: 1500
and another one withlen: 500
.If we follow the logic above and call
BIO_reset(this->sslBioToNetwork)
at the end ofDtlsTransport::SendDtlsData()
, then we are releasing 1500+500 bytes from the BIO mem rather than just the 1500 bytes that we have already sent to the endpoint. How can we know if that can happen or not?
Ok, I'm adding proper logs and it looks safe. Indeed OpenSSL invokes the onSslBioOut
with ALL the bytes that it has written into the BIO mem, so it's safe to call BIO_reset(this->sslBioToNetwork)
after sending them to the endpoint:
RTC::DtlsTransport::DtlsTransport() | ------- this->sslBioToNetwork address: 0x6000028a0240 +350ms
RTC::DtlsTransport::SendDtlsData() | >>> 1349 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 1349 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 1349 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 1349 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 1349 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 1349 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 1349 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 1349 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 1349 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 1349 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 242 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 242 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 67 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 67 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 129 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 129 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 385 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 385 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 45 bytes of DTLS data ready to be sent +0ms
**** [mem_write] 0x6000028a0240 blen: 0 + 45 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 57 bytes of DTLS data ready to be sent +1ms
**** [mem_write] 0x6000028a0240 blen: 0 + 57 +0ms
RTC::DtlsTransport::SendDtlsData() | >>> 85 bytes of DTLS data ready to be sent +848ms
**** [mem_write] 0x6000028a0240 blen: 0 + 85 +1ms
So here we are: PR https://github.com/versatica/mediasoup/pull/1345
@gkrol-katmai @pnts-se would you like to take a look to it, please?
Bug Report
We discovered a problem with the memory usage of Mediasoup Workers growing over time. The memory would go down only when clients disconnected. We reproduced the problem by using clients using msc-node/werift that create only a data producer and then send a lot of messages (3000/s). The problem also occurs when sending at a slower rate. Note that you do not need any consumers for the problem to occur.
Cause
I've tracked the problem down to the following. When data is being sent over the WebRTC channel from the client to the server the server will send ACK messages back. These are sent to OpenSSL to be encrypted. Recently, in 3.13.0, changes were introduced (in the ‘flatbuffers’ change), that include changes to the way the OpenSSL data is handled. These changes cause a buffer (DtlsTransport.sslBioToNetwork, aa BIO_s_mem) for the data coming out of OpenSSL to keep growing.
It looks like this bug was introduced due to a misunderstanding on how BIO_set_callback_ex operates. As far as I can see from the documentation this "can be used for debugging purposes to trace operations on a BIO or to modify its operation." The way it is currently used in Mediasoup is to spy on the 'write' operations OpenSSL does to the outgoing buffer (DtlsTransport.sslBioToNetwork). The data is then sent out over the network. However, it does not remove the data from the buffer, resulting in the leak.
Workaround
A workaround is to to go back to version 3.12.16, which does not experience the problem. We'll be doing this for now.
Fix?
I don't know anything about OpenSSL, but I could get the problem to go away by placing
(void)BIO_reset(this->sslBioToNetwork);
at the end ofDtlsTransport::SendDtlsData
. However, I do have some doubts on what is going on with the callback, as the documentation says it's called twice for every operation, and perhaps we're sending the data twice now? That would have to be investigated.If there's no OpenSSL experts around, I can try doing a PR. However, it will have to be reviewed and tested well as I'd be mostly programming in the dark. And that's dangerous when encryption and networking is involved.
More Info
The PR that introduced the problem (3.13.0): Worker: Make DTLS fragment stay within MTU size range (https://github.com/versatica/mediasoup/pull/1156, based on https://github.com/versatica/mediasoup/pull/1143 by @vpnts-se).
Note that this is not a 'true' memory leak, as the memory is freed when the client disconnects.