Worker panic in Rust in WebRtcMessage::new

PaulOlteanu commented 9 months ago

Bug Report

Issue description

I'm seeing occasional panics in the call to WebRtcMessage::new because of that unwrap. I never call that function myself, so I don't know where the invalid byte(s) would be coming from

Here's a log from the panic.

ibc commented 9 months ago

Also, unrelated but the docs for mediasoup 0.13 failed to build. I can file that as a separate issue if you'd like

Yes, please. With STR.

ibc commented 9 months ago

CC @nazar-pc

PaulOlteanu commented 9 months ago

Yes, please. With STR.

Not sure what STR is, but I filed an issue for that.

Thanks guys :+1:

ibc commented 9 months ago

"Steps To Reproduce" ☺️

jmillan commented 9 months ago

The provided error line is saying that the string is not a valid UTF8.

error: Utf8Error { valid_up_to: 66119, error_len: Some(1) } }.

Make sure you are providing a valid UTF8. At the end of the privided string there is an invalid character in the indicated possition (after the last bear):

🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸🧸�{"id":"657eb592-6fd1-4877-afb4-908c7c9072e8","chat":"qegrwewegw","media":[]}

jmillan commented 9 months ago

Reopening as we should not panic in this case.

PaulOlteanu commented 9 months ago

Yeah, sorry for not really explaining. I am sending user-generated text over a data channel. I have filtering to not allow for 40,000 bear emojis now but in general I would expect some kind of error rather than a panic.

nazar-pc commented 9 months ago

Hm, Rust API should either work with Strings (which are guaranteed to be valid UTF-8 unless we incorrectly use unchecked unsafe methods) or raw bytes that can be anything. Do we mix them somewhere?

@PaulOlteanu are you sending strings or bytes via available Rust API?

PaulOlteanu commented 9 months ago

My flow is I receive data on the server via a direct data consumer's on_message, validate it, then forward it via direct data producer. This would've been created via WebRtcMessage::String(message) on my end.

I believe I saw this panic happen once where I would not have forwarded it via data producer, which leads me to believe this happens when receiving data on the server from the client.

nazar-pc commented 9 months ago

So if it was a string when you sent it and you didn't use any unsafe to create a string from raw bytes, it must be a memory corruption or mishandling somewhere on mediasoup's end. That is a useful information, thanks!

If you can manage to create a reproducible app (maybe by modifying one of the examples) that'd help a lot with debugging and fixing.

PaulOlteanu commented 9 months ago

I tried modifying the rust echo example to reproduce this but I can't get it to happen just from the echo example client. I'm trying to similarly send 40,000 teddy bear emojis, but it seems to handle it fine. I tried with both mediasoup 0.12 which I was using at the time, and the master branch. I also tried with both a direct consumer on the server and a consumer over a webrtc transport

The client that caused this panic originally just sends the data with dataProducer.send(JSON.stringify(message)), (which makes sense because the server is trying to interpret the message as a WebRtcMessage::String), so I'm not really sure where the invalid bytes are coming from.

ibc commented 9 months ago

Is there something that can be done here?

Is it "expected" that mediasoup Rust crashes (due to unhandled error) if a DataChannel message of type string with invalid UTF-8 data is received? I mean, that would be terrible since it means that any remote user could crash it.

AFAIS that's what is happening here. In data_structures.rs:

impl<'a> WebRtcMessage<'a> {
    // +------------------------------------+-----------+
    // | Value                              | SCTP PPID |
    // +------------------------------------+-----------+
    // | WebRTC String                      | 51        |
    // | WebRTC Binary Partial (Deprecated) | 52        |
    // | WebRTC Binary                      | 53        |
    // | WebRTC String Partial (Deprecated) | 54        |
    // | WebRTC String Empty                | 56        |
    // | WebRTC Binary Empty                | 57        |
    // +------------------------------------+-----------+

    pub(crate) fn new(ppid: u32, payload: Cow<'a, [u8]>) -> Result<Self, u32> {
        match ppid {
            51 => Ok(WebRtcMessage::String(
                String::from_utf8(payload.to_vec()).unwrap(),
            )),
            53 => Ok(WebRtcMessage::Binary(payload)),
            56 => Ok(WebRtcMessage::EmptyString),
            57 => Ok(WebRtcMessage::EmptyBinary),
            ppid => Err(ppid),
        }
    }

If ppid is 51 (WebRTC string, AKA UTF-8 string), the data is given to String::from_utf8() which will just throw if no valid UTF-8 is given. We should be ready to catch that error.

nazar-pc commented 9 months ago

Assuming worker and library work correctly, it will never panic. But something somewhere was mishandled. There is either logic error in worker or memory safety issue or something else somewhere that causes this. Handling error is not too bad of an idea, but the root cause still needs to be fixed.

ibc commented 9 months ago

@nazar-pc I don't fully understand some points here, so I'll write specific questions taking into account the following fact:

The remote endpoint perfectly could send invalid UTF8 data into a ppid 51 (AKA WebRTC UTF8 string) in a DataChannel message. So questions:

If that happens and String::from_utf8(data) throws, will the Rust process crash or generate an unhandled exception that would make the Rust app exit?
If answer to 1. is yes, then we MUST try/catch/except/whatever in the above code to prevent that, right? And if so, can you do it please? I promise I spent lot of time yesterday trying to apply a try/catch Rust thing and I couldn't figure out how to do it.
You say that there is a memory safety issue in the code. Do you mean in Rust code, right? If so, how to diagnose it? I mean, it's clear that if invalid UTF8 data is passed to String::from_utf8() it will throw, how could that produce a panic? And how/where could be a memory safety issue in Rust?

nazar-pc commented 9 months ago

The remote endpoint perfectly could send invalid UTF8 data into a ppid 51 (AKA WebRTC UTF8 string) in a DataChannel message

Technically yes, but it shouldn't really be possible if application is written fully in Rust and reliable data channels are used. But it is possible with unreliable data channels indeed.

If that happens and String::from_utf8(data) throws, will the Rust process crash or generate an unhandled exception that would make the Rust app exit?

Panic will be generated. Panic does crash an app by default, but it is possible to catch it too (though I'm not sure if that'll work well across FFI boundary).

I promise I spent lot of time yesterday trying to apply a try/catch Rust thing and I couldn't figure out how to do it.

For panics https://doc.rust-lang.org/std/panic/fn.catch_unwind.html can be used if you can wrap the caller or https://doc.rust-lang.org/std/panic/fn.set_hook.html if you can't, but still want to catch it (I would strongly not recommend that).

Now that I thought about some more, I think it might be related to unreliable data channels, in which case perfectly valid UTF-8 can be corrupted and become invalid. @PaulOlteanu can you confirm this theory?

If that is the case, I think what we should do is to introduce a wrapper type that will be either a string if it is valid or just bytes when it is not possible to construct a valid UTF-8 string.

Something like this:

enum MaybeString {
    String(String),
    Bytes(Vec<u8>),
}

Then WebRtcMessage::String would actually hold MaybeString and user will have to decide what to do with those bytes if they are not a valid UTF-8 string. Alternatively we could just always use WebRtcMessage::String(Vec<u8>) and force user to convert them into string if necessary and deal with this edge-case there.

You say that there is a memory safety issue in the code. Do you mean in Rust code, right?

I do not see safety issues in Rust code right now, so that would be in C++ land, but unreliable data channels seem like the most likely reason here and only application can decide what to do in that case (could be to ignore the message, recover valid bytes, etc.).

ibc commented 9 months ago

The remote endpoint perfectly could send invalid UTF8 data into a ppid 51 (AKA WebRTC UTF8 string) in a DataChannel message

Technically yes, but it shouldn't really be possible if application is written fully in Rust and reliable data channels are used.

What application? You mean the remote endpoint? It doesn't matter. A remote endpoint should not be able to crash a server no matter in which language it's written.

But it is possible with unreliable data channels indeed.

See OnUsrSctpReceiveSctpData() in SctpAssociation.cpp. Unreliable datachannel messages can arrive in wrong order or some of them could be lost. So if a big UTF8 string is sent into N SCTP chunks, and some of those chunks are lost, the resulting string could indeed be invalid UTF8.

If that happens and String::from_utf8(data) throws, will the Rust process crash or generate an unhandled exception that would make the Rust app exit?

Panic will be generated. Panic does crash an app by default, but it is possible to catch it too (though I'm not sure if that'll work well across FFI boundary).

What do you mean with "FFI boundary"? Perhaps I don't understand the scenario correctly but AFAIU this is:

Server mediasoup Rust app.
Remote client sends a DataChannel string message with invalid UTF8 bytes.
mediasoup worker receives it and passes it to Rust layer.
Rust layer tries to generate a UTF8 string and it fails.
It should not crash/terminate the Rust app but instead should catch the decoding error, show a warning log and ignore the received data.

I promise I spent lot of time yesterday trying to apply a try/catch Rust thing and I couldn't figure out how to do it.

For panics https://doc.rust-lang.org/std/panic/fn.catch_unwind.html can be used if you can wrap the caller or https://doc.rust-lang.org/std/panic/fn.set_hook.html if you can't, but still want to catch it (I would strongly not recommend that).

Ok, I'll try to add the former.

If that is the case, I think what we should do is to introduce a wrapper type that will be either a string if it is valid or just bytes when it is not possible to construct a valid UTF-8 string.

Nope. In WebRTC, DataChannel messages can be "WebRTC UTF8 String" (SCTP PPID 51) or "WebRTC Binary" (SCTP PPID 53). Well, there are more types:

+-------------------------------+----------+
| Value                         | SCTP     |
|                               | PPID     |
+-------------------------------+----------+
| WebRTC String                 | 51       |
| WebRTC Binary Partial         | 52       |
| (Deprecated)                  |          |
| WebRTC Binary                 | 53       |
| WebRTC String Partial         | 54       |
| (Deprecated)                  |          |
| WebRTC String Empty           | 56       |
| WebRTC Binary Empty           | 57       |
+-------------------------------+----------+

The receiver should not do magic if data is invalid. The sender should specify PPID 53 if the content of the message is not UTF8 string. In receiver side, it should ignore the received message if is invalid, which can only happen if it's WebRTC string with invalid UTF8 (due to whatever).

You say that there is a memory safety issue in the code. Do you mean in Rust code, right?

I do not see safety issues in Rust code right now, so that would be in C++ land

Any evidence that there could be a problem in C++ side? The log in the issue clearly points to Rust. I may be wrong but AFAIU this is just about the mediasoup Rust code not being ready to catch errors when decoding invalid UTF8.

nazar-pc commented 9 months ago

What application? You mean the remote endpoint? It doesn't matter. A remote endpoint should not be able to crash a server no matter in which language it's written.

Yeah, you're right actually.

What do you mean with "FFI boundary"?

C++ code calls Rust function, if Rust function panics (like in this case) C++ code will have no idea what to do with that panic, it will not be able to handle it gracefully because it is a different language with different convention. Just like it is not easy to catch C++ exception in Rust.

Ok, I'll try to add the former.

You don't need to though if it is related to unreliable data channel.

Nope. In WebRTC, DataChannel messages can be "WebRTC UTF8 String" (SCTP PPID 51) or "WebRTC Binary" (SCTP PPID 53). Well, there are more types:

Well, then it is a bug in the worker that it sends strings, which are not strings, back to Rust?

In receiver side, it should ignore the received message if is invalid, which can only happen if it's WebRTC string with invalid UTF8 (due to whatever).

Are you sure ignoring is the right thing though? It might be very hard to debug if some messages just disappear. We should at least have a warning for this.

Any evidence that there could be a problem in C++ side? The log in the issue clearly points to Rust. I may be wrong but AFAIU this is just about the mediasoup Rust code not being ready to catch errors when decoding invalid UTF8.

Rust side gets incorrect bytes, they come from C++ and I see no bugs in very limited Rust code that would corrupt in-memory bytes, so it is the most likely that something in the worker is not slicing bytes properly or corrupts memory somewhere. Most of Rust code is safe where such things are not possible.

ibc commented 9 months ago

What do you mean with "FFI boundary"?

C++ code calls Rust function, if Rust function panics (like in this case) C++ code will have no idea what to do with that panic, it will not be able to handle it gracefully because it is a different language with different convention. Just like it is not easy to catch C++ exception in Rust.

Yes, clear. That's why I always assumed this must be done when the error happens.

Ok, I'll try to add the former.

You don't need to though if it is related to unreliable data channel.

Why not? We don't want the server app to crash.

Nope. In WebRTC, DataChannel messages can be "WebRTC UTF8 String" (SCTP PPID 51) or "WebRTC Binary" (SCTP PPID 53). Well, there are more types:

Well, then it is a bug in the worker that it sends strings, which are not strings, back to Rust?

Here the thing, in worker we don't validate the UTF8 content of ppid 51 received or sent messages. In fact, in Node we do not even emit a string but a Buffer. The Node app can convert it to UTF8 string later:

https://mediasoup.org/documentation/v3/mediasoup/api/#dataConsumer-on-message

I'm not sure what's the right way to go is here. Should we validate UTF8 messages in worker side?

In receiver side, it should ignore the received message if is invalid, which can only happen if it's WebRTC string with invalid UTF8 (due to whatever).

Are you sure ignoring is the right thing though? It might be very hard to debug if some messages just disappear. We should at least have a warning for this.

Yes, I also said somewhere else that we should log a warning. But we should not expose invalid UTF8 strings to the user app (despite we do in Node). As said above this deserves a proper decision.

Any evidence that there could be a problem in C++ side? The log in the issue clearly points to Rust. I may be wrong but AFAIU this is just about the mediasoup Rust code not being ready to catch errors when decoding invalid UTF8.

Rust side gets incorrect bytes, they come from C++ and I see no bugs in very limited Rust code that would corrupt in-memory bytes, so it is the most likely that something in the worker is not slicing bytes properly or corrupts memory somewhere. Most of Rust code is safe where such things are not possible.

User sends a very long valid UTF8 string over unreliable datachannel.
So long that it requires 3 SCTP separate chunks.
One of those 3 chunks is lost or they arrive in wrong order to mediasoup.
Here we may get an invalid UTF8 string.

Yes, like it or not but there is no way to know if a chunk has been lost in the API exposed by usrsctp. See OnUsrSctpReceiveSctpData() in SctpAssociation.cpp. There is no "chunk sequence number" at all.

nazar-pc commented 9 months ago

The Node app can convert it to UTF8 string later I'm not sure what's the right way to go is here. Should we validate UTF8 messages in worker side?

I think doing the same in Rust makes sense to me. Yes, it is a bit awkward and we can document that in case of reliable data channels it will always contain valid UTF-8 string, but move the responsibility to actually convert it and handle errors onto the user of the library.

If we can't guarantee it is a valid UTF-8, then I think the correct API is to expose raw bytes.

ibc commented 9 months ago

we can document that in case of reliable data channels it will always contain valid UTF-8 string

This is not true. A malicious user could send a DataChannel message with ppip 51 and invalid UTF-8 bytes.

If we can't guarantee it is a valid UTF-8, then I think the correct API is to expose raw bytes.

I see tons of usages of WebRtcMessage enum, for receiving and for sending. Not sure how it should be done.

BTW why the underscore here in _payload?

impl DirectDataConsumer {
    /// Sends direct messages from the Rust process.
    pub async fn send(&self, message: WebRtcMessage<'_>) -> Result<(), RequestError> {
        let (ppid, _payload) = message.into_ppid_and_payload();

        self.inner
            .channel
            .request(
                self.inner.id,
                DataConsumerSendRequest {
                    ppid,
                    payload: _payload.into_owned(),
                },
            )
            .await
    }
}

nazar-pc commented 9 months ago

This is not true. A malicious user could send a DataChannel message with ppip 51 and invalid UTF-8 bytes.

Well, then we totally should have just bytes.

BTW why the underscore here in _payload?

It is a typo, shouldn't have _ prefix.

ibc commented 9 months ago

What should I do with this?

/// Container used for sending/receiving messages using `DirectTransport` data producers and data
/// consumers.
#[derive(Debug, Clone)]
pub enum WebRtcMessage<'a> {
    /// String
    String(String),
    /// Binary
    Binary(Cow<'a, [u8]>),
    /// EmptyString
    EmptyString,
    /// EmptyBinary
    EmptyBinary,
}

Do we still need this WebRtcMessage enum? Should we turn it into some type / alias? Or just replace it with Binary(Cow<'a, [u8]>) everywhere?

nazar-pc commented 9 months ago

No, we still want to know that it was supposed to be a string, so we just change String(String) to String(Cow<'a, [u8]>).

ibc commented 9 months ago

No, we still want to know that it was supposed to be a string, so we just change String(String) to String(Cow<'a, [u8]>).

I don't know what that means. A string without encoding? The user must encode it into UTF-8?

nazar-pc commented 9 months ago

I mean when message is sent or received we want to know which PPID it corresponds to, that is what enum variants represent.

ibc commented 9 months ago

Does this look good? https://github.com/versatica/mediasoup/pull/1289

PaulOlteanu commented 9 months ago

I like the solution of letting us parse the bytes :+1: thanks guys.

The client that was sending these messages does set maxRetransmits to 1, which I believe results in an unreliable data channel, so I guess that would mean the data could be getting corrupted that way?

Out of curiosity, is it also possible the ppid could get invalidated when using an unreliable channel, and thus the "wrong" enum variant would be created? (not really a problem as that can be handled, just curious)

ibc commented 9 months ago

The client that was sending these messages does set maxRetransmits to 1, which I believe results in an unreliable data channel, so I guess that would mean the data could be getting corrupted that way?

Yes. Unreliable data channels should never be used to send big content since content is split into chunks, each chunk is carried in a separate SCTP packet and some SCTP packets may get lost or arrive in wrong order.

Out of curiosity, is it also possible the ppid could get invalidated when using an unreliable channel, and thus the "wrong" enum variant would be created? (not really a problem as that can be handled, just curious)

No, the ppid cannot be lost, it's part of the SCTP message/chunk. It arrives or doesn't arrive.

versatica / mediasoup

Worker panic in Rust in WebRtcMessage::new #1269

Bug Report

Issue description