versatica / mediasoup

Cutting Edge WebRTC Video Conferencing
https://mediasoup.org
ISC License
6.27k stars 1.13k forks source link

WorkerSettings: Add disableLiburing option (enable_liburing in Rust) #1442

Closed ibc closed 3 months ago

ibc commented 3 months ago

Details

ibc commented 3 months ago

OMG Rust tests fail because of this:

mediasoup-worker::mediasoup_worker_run() | settings error: unknown long option given as argument

I hate that getopt.h ancient thing...

And BTW it fails due to this obviously wrong code that I'm fixing now, but it doesn't justify the problem:

if enable_liburing {
    spawn_args.push("--disable_liburing".to_string());
}
ibc commented 3 months ago

Ok, this is done.

ibc commented 3 months ago

Why is this happening in CI????

https://github.com/versatica/mediasoup/actions/runs/10323061408/job/28579657794?pr=1442

cargo clippy works in my machine and we force Rust version!!!!

ibc commented 3 months ago

OMG I don't want to spend all the weekend with this.

ibc commented 3 months ago

Ok, cargo clippy --all-targets -- -D warnings fails in local too.

nazar-pc commented 3 months ago

I suspect you might be forcing a different version locally. It wants you to write this idiomatic Rust instead:

let settings = WorkerSettings {
    enable_liburing: false,
    ..WorkerSettings::default()
};
ibc commented 3 months ago

Why does it complain about this?

error: field assignment outside of initializer for an instance created with Default::default()
  --> rust/src/router/tests.rs:21:13
   |
21 |             settings.enable_liburing = false;
   |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
note: consider initializing the variable with `worker::WorkerSettings { enable_liburing: false, ..Default::default() }` and removing relevant reassignments
  --> rust/src/router/tests.rs:20:13
   |
20 |             let mut settings = WorkerSettings::default();
   |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is the code:

worker_manager
        .create_worker({
            let mut settings = WorkerSettings::default();
            settings.enable_liburing = false;

            settings
        })
        .await
        .expect("Failed to create worker")

We do THE SAME in many other files such as here:

let worker = worker_manager
            .create_worker({
                let mut settings = WorkerSettings::default();

                settings.log_level = WorkerLogLevel::Debug;
                settings.log_tags = vec![WorkerLogTag::Info];
                settings.rtc_port_range = 0..=9999;
                settings.dtls_files = Some(WorkerDtlsFiles {
                    certificate: "tests/integration/data/dtls-cert.pem".into(),
                    private_key: "tests/integration/data/dtls-key.pem".into(),
                });
                settings.libwebrtc_field_trials =
                    Some("WebRTC-Bwe-AlrLimitedBackoff/Disabled/".to_string());
                settings.app_data = AppData::new(CustomAppData { bar: 456 });

                settings
            })
            .await
            .expect("Failed to create worker with custom settings");
nazar-pc commented 3 months ago

I suspect it has a heuristic about number of fields or just didn't learn to handle more complex cases, not sure which one of those two

ibc commented 3 months ago

Pushed, should be green now.

nazar-pc commented 3 months ago

Generally what it suggests does look a bit nicer and even shorter, but not always

ibc commented 3 months ago

OMG now tests failing...

ibc commented 3 months ago

WTH???

https://github.com/versatica/mediasoup/actions/runs/10323816619/job/28581996102?pr=1442

  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
nazar-pc commented 3 months ago

Memory issues... again :confused:

ibc commented 3 months ago

Memory corruption... again 😕

When did this happen in a test recently? I don't remember.

ibc commented 3 months ago

It fails consistently, this is depressing, no changes have been done that should affect this AT ALL:

ibc commented 3 months ago

No idea why the issue happens consistently but indeed it's the already know issue in Rust due to the separate usrsctp thread. I worked on this some months ago and there is an unfinished and very complex PR for which honestly I don't have time to revive now. I think I will disable these failing tests because we do know that the underlying code can fail. Well, all this assuming that the failing test is using two workers with pipe transports. If not, ignore all I said here.

ibc commented 3 months ago

Those are the ways Rust tests systematically fail. Since they run in parallel and taking into account that memory issues could make tests fail at any time once

running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::direct_transport::tests::router_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::pipe_transport::tests::data_producer_close_is_transmitted_to_pipe_data_consumer ... ok
test router::plain_transport::tests::router_close_event ... ok
test router::producer::tests::transport_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
Error: Process completed with exit code 101.
##[debug]Finishing: cargo test
running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
test router::direct_transport::tests::router_close_event ... ok
test router::plain_transport::tests::router_close_event ... ok
test router::pipe_transport::tests::data_producer_close_is_transmitted_to_pipe_data_consumer ... ok
test router::producer::tests::transport_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
Error: Process completed with exit code 101.
running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::direct_transport::tests::router_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::plain_transport::tests::router_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/runner/work/mediasoup/mediasoup/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
Error: Process completed with exit code 101.

I would like to say that this is pipe transport related but the third test attempt doesn't even execute any "pipe" test.

ibc commented 3 months ago

This is completely amazing. Rust tests in v3 branch don't fail. But Rust tests in this PR branch fial consistently. Changes in this PR cannot be the culprit at all, they do not affect anything in Rust.

ibc commented 3 months ago

@nazar-pc help please? I doubt that Rust tests are even failing due to this well know issue:

https://github.com/versatica/mediasoup/issues/1352

In a comment above you can see that sometimes pipe transports do not even run and tests still fail. Maybe I'm wrong but AFAIK we only run more than 1 workers in Rust tests in those pipe transport related tests, so I'm completely lost.

nazar-pc commented 3 months ago

Changes in this PR cannot be the culprit at all, they do not affect anything in Rust.

Either changes in this PR are the culprit (however unlikely that seems) or they trigger an issue that existed before as well, but was hard to reproduce.

Maybe I'm wrong but AFAIK we only run more than 1 workers in Rust tests in those pipe transport related tests, so I'm completely lost.

There are as many tests running concurrently as CPU cores on the machine by default, so we have many workers with potentially different settings running at the same time and order is not deterministic.

ibc commented 3 months ago

Maybe I'm wrong but AFAIK we only run more than 1 workers in Rust tests in those pipe transport related tests, so I'm completely lost.

There are as many tests running concurrently as CPU cores on the machine by default, so we have many workers with potentially different settings running at the same time and order is not deterministic.

Do you mean that different tests run in parallel in the same Rust process? I mean, of course! Ok, I assume there is no way in Rust to create a separate process for each test file, right?

nazar-pc commented 3 months ago

Do you mean that different tests run in parallel in the same Rust process? I mean, of course! Ok, I assume there is no way in Rust to create a separate process for each test file, right?

Yes, it does run in the same process by default. You could probably make it run in separate processes somehow, but we're dealing with memory safe language, so why, right? That is not a solution to the problem we have here anyway :slightly_smiling_face:

ibc commented 3 months ago

You could probably make it run in separate processes somehow

No, trust me, I couldn't.

but we're dealing with memory safe language, so why, right?

Because there is also C code running its own separate thread: https://github.com/versatica/mediasoup/issues/1352

ibc commented 3 months ago

Ok, some news:

So it seems that we are sending some command line arg without value. I will print things to debug.

ibc commented 3 months ago

Ok, some news:

  • I can reproduce the Rust test failure in a real Linux host, no need to wait for CI.
  • I've made command line argument mandatory here: 669ddca
  • Now Rust tests fail even in macOS with this error:
    mediasoup-worker::mediasoup_worker_run() | settings error: missing value in command line 
    argument: (null)
    test router::data_producer::tests::transport_close_event ... ok
    test router::tests::worker_close_event ... FAILED

So it seems that we are sending some command line arg without value. I will print things to debug.

Ignore, I forgot to add =true in Rust. Now things should work.

ibc commented 3 months ago

This is strongly depressing. Now WHAT????

running 27 tests
test data_structures::tests::dtls_fingerprint ... ok
test ortc::tests::generate_router_rtp_capabilities_succeeds ... ok
test ortc::tests::generate_router_rtp_capabilities_too_many_codecs ... ok
test ortc::tests::generate_router_rtp_capabilities_unsupported ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_get_consumable_rtp_parameters_get_consumer_rtp_parameters_get_pipe_consumer_rtp_parameters_succeeds ... ok
test ortc::tests::get_producer_rtp_parameters_mapping_unsupported ... ok
test router::audio_level_observer::tests::router_close_event ... ok
test router::active_speaker_observer::tests::router_close_event ... ok
test router::data_consumer::tests::transport_close_event ... ok
test router::consumer::tests::producer_close_event ... ok
test router::consumer::tests::transport_close_event ... ok
test router::data_consumer::tests::data_producer_close_event ... ok
test router::data_producer::tests::transport_close_event ... ok
error: test failed, to rerun pass `-p mediasoup --lib`

Caused by:
  process didn't exit successfully: `/home/deploy/src/mediasoup-v3-deploy/target/debug/deps/mediasoup-8f5b5278ff2f31a1` (signal: 11, SIGSEGV: invalid memory reference)
ibc commented 3 months ago

I am 96% sure that the memory issue is here:

https://github.com/versatica/mediasoup/pull/1442/files#diff-5a9b1b4b01d5764dd21b1d1b7533d5795430dead60f84d450656db89da4e3930R18

ibc commented 3 months ago

Ohhh, so we are considering that liburing is enabled or disabled for all workers, but Rust tests run in parallel in same process and in just one of those tests we were disabling liburing. In the other ones liburing is enabled. So here the problem:

void DepLibUring::ClassInit()
{
    const auto mayor = io_uring_major_version();
    const auto minor = io_uring_minor_version();

    MS_DEBUG_TAG(info, "liburing version: \"%i.%i\"", mayor, minor);

    if (Settings::configuration.liburingDisabled)
    {
        MS_DEBUG_TAG(info, "liburing disabled by user settings");

        return;
    }

    // This must be called first.
    DepLibUring::CheckRuntimeSupport();

    if (DepLibUring::IsEnabled())
    {
        DepLibUring::liburing = new LibUring();

        MS_DEBUG_TAG(info, "liburing enabled");
    }
    else
    {
        MS_DEBUG_TAG(info, "liburing not enabled");
    }
}
ibc commented 3 months ago

Finally green. Merging!