SMP issue - Githubissues

jschwe commented 4 years ago

I don't think we have an issue for this yet, so I'm opening one now to track this issue. Currently there are issues running rusty-demo with multiple cores, where sometimes it takes a long time to complete the demo.

Jobs failed due to SMP timeout (May 20 - June 06):

Possibly related Job logs:

thread 'main' panicked at 'Trying to get the scheduler for core 1, but it isn't available', src/scheduler/mod.rs:530:9 for https://github.com/hermitcore/rusty-hermit/commit/64ec710f8a683eeb0a5a6d3baa591fad794e4fe5

stlankes commented 4 years ago

I found the race to you log and fix it with 29452bb, but their is still a race in our code.

jschwe commented 3 years ago

I'm bumping this issue, with an update. When running rusty-demo via `qemu-system-x86_64 -cpu qemu64,apic,fsgsbase,rdtscp,xsave,fxsr -display none -smp 1 -m 1G -serial stdio -kernel loader/target/x86_64-unknown-hermit-loader/debug/rusty-loader -initrd target/x86_64-unknown-hermit/debug/rusty_demo > log.log` with smp values 1,2,3 and 4, and `HERMIT_LOG_LEVEL_FILTER=Debug` The log files get increasingly larger, with the main culprit being the laplace test (which uses rayon internally):	# Cores	# lines
1	15k	14.2 s
2	37k	18.3 s
3	86k	31.5 s
4	135k	51.0 s

It looks like the tasks on the additional cpus get blocked very often. ~~I'm not exactly sure in which part of rusty-demo this is~~, but this looks like it could be some sort of a synchronization problem. log_single_core.log log_2_core.log log_3_core.log log_4_core.log

Edit 1: Formatted table and added laplace time, since the output in the logfiles mostly originates from the laplace test. Edit 2: Regarding performance of the laplace test, on linux in a VM it runs in 0.24s (4 cores) - 0.84s (1 core), so the test itself is okay. (Tested with taskset and cargo run --target=x86_64-unknown-linux-gnu in the examples/demo folder).

jschwe commented 3 years ago

I just stumbled upon cross, which is maintained by the Tools team of the rust-embedded WG. In the Supported targets section they mention the following:

Also, testing is very slow. cross test runs units tests sequentially because QEMU gets upset when you spawn multiple threads. This means that, if one of your unit tests spawns threads, then it's more likely to fail or, worst, never terminate.

Did someone spot SMP related issues when using uhyve (or running bare-metal)? Otherwise, it might be worth investigating if our issue is possibly due to QEMU.

stlankes commented 3 years ago

I never saw the issue, if we use KVM... Should we disable SMP tests at GitHub? At GitLab is nearly working, based on KVM and support SMP tests. Bors ist able to trigger these tests.

stlankes commented 3 years ago

I moved all SMP tests to the GitLab Pipeline. This pipeline tests also the SMP support with Qemu, but it used KVM to accelerate the tests. We will see, if have still an issue on this platform. The pipeline runs only, if we use bors to test our kernel.

@jschwe Do you have an idea, why your integration tests aren't working on this pipeline.

jschwe commented 3 years ago

@stlankes The output from the log seems really strange to me. In hermit_test_runner.py one of the first actions is to print the passed executable argument directly after parsing the args. This does not happen for the second test, and I see no reason why this should be the case, unless there is some bug in pythons argparse (which seems unlikely). The output worked fine for the first test (unit-tests) which where skipped as expected.

Another thing I noticed is that the total duration according to gitlab is 2h, while the single steps only add up to 1h, which I find strange. Could you maybe rerun the tests? We should probably move this into a seperate issue though, since its barely related.

stlankes commented 3 years ago

I think that we fixed this issue. Should we close it?

jschwe commented 3 years ago

Maybe we should add a note the the Readme before closing, saying that there are issues with multiple threads and QEMU.

hermit-os / kernel

SMP issue #78

Jobs failed due to SMP timeout (May 20 - June 06):

Possibly related Job logs: