intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.23k stars 734 forks source link

Catch2 segfault when running multiple tests #6720

Closed TApplencourt closed 1 year ago

TApplencourt commented 2 years ago

Describe the bug

Where using Catch2 and multiple tests, a segfault occurs.

To Reproduce

$ wget https://raw.githubusercontent.com/catchorg/Catch2/devel/extras/catch_amalgamated.{h,c}pp
$ cat test.cpp
#include "catch_amalgamated.hpp"
#include <sycl/sycl.hpp>

int foo() {
  sycl::queue Q;
  auto *b = sycl::malloc_shared<int>(1, Q);
  Q.single_task([=]() { b[0] = 1; }).wait();
  return b[0];
}

TEST_CASE("Test0") {
  CHECK(foo() == 1);
}

TEST_CASE("Test1") {
  CHECK(foo() == 1);
}
# `-fno-finite-math-only` is not important, just to avoid some warning
$ clang++  -fsycl -fno-finite-math-only test.cpp catch_amalgamated.cpp
$./a.out
Randomness seeded to: 2960337635

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
a.out is a Catch2 v3.1.0 host application.
Run with -? for options

-------------------------------------------------------------------------------
Test1
-------------------------------------------------------------------------------
test.cpp:32
...............................................................................

test.cpp:32: FAILED:
due to a fatal error condition:
  SIGSEGV - Segmentation violation signal

===============================================================================
test cases: 2 | 1 passed | 1 failed
assertions: 2 | 1 passed | 1 failed

Segmentation fault

Environment (please complete the following information):

Additional context

Running the test independently work

$ /a.out test0
Filters: test0
Randomness seeded to: 220773856
===============================================================================
All tests passed (1 assertion in 1 test case)

$./a.out test1
Filters: test1
Randomness seeded to: 2220531901
===============================================================================
All tests passed (1 assertion in 1 test case)

This is the gdb backtrace

#0  0x00000000004076a1 in foo() ()
#1  0x00000000004078c8 in CATCH2_INTERNAL_TEST_0() ()
#2  0x0000000000435950 in Catch::RunContext::invokeActiveTestCase() ()
#3  0x00000000004331dd in Catch::RunContext::runCurrentTest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ()
#4  0x0000000000431981 in Catch::RunContext::runTest(Catch::TestCaseHandle const&) ()
#5  0x000000000042039f in Catch::(anonymous namespace)::TestGroup::execute() ()
#6  0x000000000041d0f0 in Catch::Session::runInternal() ()
#7  0x000000000041c8a1 in Catch::Session::run() ()
#8  0x000000000042e341 in main ()

Commenting out the kernel submission, or the sycl::malloc, makes the segfault disappear.

AlexeySachkov commented 1 year ago

Hi @TApplencourt,

Have you tried using a newer DPC++ compiler version or a newer IGC/NEO runtime?

I'm using fresher setup of a recent clang++ (clang version 16.0.0 (https://github.com/intel/llvm.git 5d5e9f4e4096de617d29532767c7805dbded2ecd)) and in combination with NEO 22.35.24055 (L0 1.3.24055) I cannot reproduce the crash even in 1000 runs.

TApplencourt commented 1 year ago

First, thanks for having a look :). Yes, I still see the segfault.

$ icpx -fsycl -fno-finite-math-only test.cpp catch_amalgamated.cpp
$ ./a.out
Randomness seeded to: 3626097925

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
a.out is a Catch2 v3.2.0 host application.
Run with -? for options

-------------------------------------------------------------------------------
Test1
-------------------------------------------------------------------------------
test.cpp:15
...............................................................................

test.cpp:15: FAILED:
due to a fatal error condition:
  SIGSEGV - Segmentation violation signal

===============================================================================
test cases: 2 | 1 passed | 1 failed
assertions: 2 | 1 passed | 1 failed

Segmentation fault (core dumped)

So I think it's a problem only with icpx and not with intel/llvm clang++. Can you try with icpx?

jandres742 commented 1 year ago

So the issue is here is that the application is using a signal handler and is using malloc_shared allocations. however, L0 GPU driver, uses a signal handler to handle implicit migrations of shared allocation. This is what is called UMD migration, and is used to implement implicit migrations of shared-allocation on HW w/o page-fault support.

Please see this documented in https://www.intel.com/content/www/us/en/developer/articles/release-notes/gdb-release-notes.html

Applications that use unified shared memory (USM) may appear as raising a SIGSEGV when a USM-allocated memory is being accessed. This is a mechanism used by the runtime to trigger memory migration. In such cases, sending the signal back to the application resumes the program. For this, use GDB's signal SIGSEGV command.

So this is expected behavior at the moment.

TApplencourt commented 1 year ago

Ok as crazy as I sound, I confirm your finding! The following code only segfault with malloc_shared and not when using malloc_host or malloc_device.

#include "catch_amalgamated.hpp"
#include <sycl/sycl.hpp>

int foo() {
  sycl::queue Q;
#if defined(HOST)
  auto *b = sycl::malloc_host<int>(1, Q);
#elif defined(DEVICE)
  auto *b = sycl::malloc_device<int>(1, Q);
#else
  auto *b = sycl::malloc_shared<int>(1, Q);
#endif
  Q.single_task([=]() { b[0] = 1; }).wait();

# if defined(DEVICE)
  int b2;
  Q.copy<int>(b, &b2, 1).wait();
  return b2;
#else
  return b[0];
#endif
}

TEST_CASE("Test0") {
  CHECK(foo() == 1);
}

TEST_CASE("Test1") {
  CHECK(foo() == 1);
}

Thanks a lot! I have so many follow-up questions... Why didn't we see the segfault when using clang++ and not icpx? Pure luck? :)

But most importantly, catch2 is the de facto standard testing framework for C++, and many people use USM with the L0 backend on  Intel® Iris® Xe Graphics. I think the current behavior will make a lot of people hum uncomfortable.

It looks like it's not a specific SYCL issue, so I think we can close this ticket. I'll talk with you offline about possible follow-up.

Thanks a lot!

aelovikov-intel commented 1 year ago

Why didn't we see the segfault when using clang++ and not icpx? Pure luck?

I was seeing this issue using clang++ built from this project.

mirenradia commented 9 months ago

If you want to use malloc_shared with Catch2 on a SYCL backend that uses segfaults to trigger memory migrations, you can disable Catch2's signal handling by defining CATCH_CONFIG_NO_POSIX_SIGNALS (either the CMake option or the preprocessor macro depending on which build system you are using). See the Catch2 docs page on configuration.

TApplencourt commented 9 months ago

Perfect timing! Thanks for the update and the workaround ! :) Indeed looking at my testing the bug have been fixed with some newer toolchain 🎉