Closed TApplencourt closed 1 year ago
Hi @TApplencourt,
Have you tried using a newer DPC++ compiler version or a newer IGC/NEO runtime?
I'm using fresher setup of a recent clang++
(clang version 16.0.0 (https://github.com/intel/llvm.git 5d5e9f4e4096de617d29532767c7805dbded2ecd)) and in combination with NEO 22.35.24055
(L0 1.3.24055
) I cannot reproduce the crash even in 1000 runs.
First, thanks for having a look :). Yes, I still see the segfault.
$ icpx -fsycl -fno-finite-math-only test.cpp catch_amalgamated.cpp
$ ./a.out
Randomness seeded to: 3626097925
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
a.out is a Catch2 v3.2.0 host application.
Run with -? for options
-------------------------------------------------------------------------------
Test1
-------------------------------------------------------------------------------
test.cpp:15
...............................................................................
test.cpp:15: FAILED:
due to a fatal error condition:
SIGSEGV - Segmentation violation signal
===============================================================================
test cases: 2 | 1 passed | 1 failed
assertions: 2 | 1 passed | 1 failed
Segmentation fault (core dumped)
So I think it's a problem only with icpx
and not with intel/llvm
clang++. Can you try with icpx
?
So the issue is here is that the application is using a signal handler and is using malloc_shared allocations. however, L0 GPU driver, uses a signal handler to handle implicit migrations of shared allocation. This is what is called UMD migration, and is used to implement implicit migrations of shared-allocation on HW w/o page-fault support.
Please see this documented in https://www.intel.com/content/www/us/en/developer/articles/release-notes/gdb-release-notes.html
Applications that use unified shared memory (USM) may appear as raising a SIGSEGV when a USM-allocated memory is being accessed. This is a mechanism used by the runtime to trigger memory migration. In such cases, sending the signal back to the application resumes the program. For this, use GDB's signal SIGSEGV command.
So this is expected behavior at the moment.
Ok as crazy as I sound, I confirm your finding! The following code only segfault with malloc_shared
and not when using malloc_host
or malloc_device
.
#include "catch_amalgamated.hpp"
#include <sycl/sycl.hpp>
int foo() {
sycl::queue Q;
#if defined(HOST)
auto *b = sycl::malloc_host<int>(1, Q);
#elif defined(DEVICE)
auto *b = sycl::malloc_device<int>(1, Q);
#else
auto *b = sycl::malloc_shared<int>(1, Q);
#endif
Q.single_task([=]() { b[0] = 1; }).wait();
# if defined(DEVICE)
int b2;
Q.copy<int>(b, &b2, 1).wait();
return b2;
#else
return b[0];
#endif
}
TEST_CASE("Test0") {
CHECK(foo() == 1);
}
TEST_CASE("Test1") {
CHECK(foo() == 1);
}
Thanks a lot! I have so many follow-up questions... Why didn't we see the segfault when using clang++ and not icpx? Pure luck? :)
But most importantly, catch2 is the de facto standard testing framework for C++, and many people use USM with the L0 backend on Intel® Iris® Xe Graphics. I think the current behavior will make a lot of people hum uncomfortable.
It looks like it's not a specific SYCL issue, so I think we can close this ticket. I'll talk with you offline about possible follow-up.
Thanks a lot!
Why didn't we see the segfault when using clang++ and not icpx? Pure luck?
I was seeing this issue using clang++ built from this project.
If you want to use malloc_shared
with Catch2 on a SYCL backend that uses segfaults to trigger memory migrations, you can disable Catch2's signal handling by defining CATCH_CONFIG_NO_POSIX_SIGNALS
(either the CMake option or the preprocessor macro depending on which build system you are using). See the Catch2 docs page on configuration.
Perfect timing! Thanks for the update and the workaround ! :) Indeed looking at my testing the bug have been fixed with some newer toolchain 🎉
Describe the bug
Where using
Catch2
and multiple tests, a segfault occurs.To Reproduce
Environment (please complete the following information):
Additional context
Running the test independently work
This is the gdb backtrace
Commenting out the kernel submission, or the
sycl::malloc,
makes the segfault disappear.