kokkos / kokkos-resilience

Resilience Extensions for Kokkos
Other
4 stars 2 forks source link

CUDA support issues on Power9+V100 systems #10

Open nphtan opened 3 years ago

nphtan commented 3 years ago

I'm running into issues building with CUDA support on Power9. The platform is a dual socket Power9 node with 32 cores and 2 V100 GPUs per node. Building with CUDA support has 2 issues I've seen so far. The first is a simple mistake in ResCudaSpace.hpp(273) that generates a bunch of syntax errors. ... [ 25%] Building CXX object CMakeFiles/resilience.dir/src/resilience/cuda/ResCuda.cpp.o nvcc_wrapper has been given GNU extension standard flag -std=gnu++14 - reverting flag to -std=c++14 /home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/cuda/ResCudaSpace.hpp(273): error: enable_if is not a template

The fix is to add std:: to both the enable_if and is_same template functions on line 273.

The second error comes further along when building the tests.

[ 50%] Building CXX object tests/CMakeFiles/resilience_tests.dir/TestResilience.cpp.o /home/ntan1/KokkosResilience/kokkos/build/install/include/impl/Kokkos_Profiling_Interface.hpp(79): error: incomplete type is not allowed detected during: instantiation of "uint32_t Kokkos::Profiling::Experimental::device_id(const ExecutionSpace &) [with ExecutionSpace=KokkosResilience::ResCuda]" /home/ntan1/KokkosResilience/kokkos/build/install/include/Kokkos_Parallel.hpp(171): here instantiation of "void Kokkos::parallel_for(const ExecPolicy &, const FunctorType &, const std::cxx11::string &, std::enable_if<Kokkos::is_execution_policy::value, void>::type *) [with ExecPolicy=Kokkos::RangePolicy, FunctorType=lambda ->void]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestResilience.cpp(93): here instantiation of "void TestResilientRange<ExecSpace, ScheduleType, DataType>::test_for() [with ExecSpace=Kokkos::Serial, ScheduleType=Kokkos::Schedule, DataType=int]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestResilience.cpp(117): here instantiation of "void TestResilience_range_Test::TestBody() [with gtestTypeParam=Kokkos::Serial]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here implicit generation of "TestResilience_range_Test::~TestResilience_range_Test() [with gtestTypeParam=Kokkos::Serial]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here [ 4 instantiation contexts not shown ] implicit generation of "testing::internal::TestFactoryImpl::~TestFactoryImpl() [with TestClass=TestResilience_range_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestResilience_range_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here implicit generation of "testing::internal::TestFactoryImpl::TestFactoryImpl() [with TestClass=TestResilience_range_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestResilience_range_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of "nv_bool testing::internal::TypeParameterizedTest<Fixture, TestSel, Types>::Register(const char , const testing::internal::CodeLocation &, const char , const char *, int, const std::vector<std::cxx11::string, std::allocator<std::cxx11::string>> &) [with Fixture=TestResilience, TestSel=testing::internal::TemplateSel, Types=gtest_type_paramsTestResilience]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestResilience.cpp(110): here

1 error detected in the compilation of "/tmp/tmpxft_00002401_00000000-6_TestResilience.cpp1.ii". make[2]: [tests/CMakeFiles/resilience_tests.dir/TestResilience.cpp.o] Error 1 make[1]: [tests/CMakeFiles/resilience_tests.dir/all] Error 2 make: *** [all] Error 2

I'm not sure how to fix this.

nmm0 commented 3 years ago

TestResilience.cpp should currently be disabled, since it relies on code that is not implemented. Are there additional tests giving problems?

nphtan commented 3 years ago

There's a syntax bug in ResCudaSpace.hpp

diff --git a/src/resilience/cuda/ResCudaSpace.hpp b/src/resilience/cuda/ResCudaSpace.hpp index 970151e..8fc3209 100644 --- a/src/resilience/cuda/ResCudaSpace.hpp +++ b/src/resilience/cuda/ResCudaSpace.hpp @@ -270,7 +270,7 @@ struct VerifyExecutionCanAccessMemorySpace< KokkosResilience::ResCudaSpace , Kok /* Running in CudaSpace attempting to access an unknown space: error / template< class OtherSpace > struct VerifyExecutionCanAccessMemorySpace<

With removed TestResilience.cpp and the syntax error fix the build fails while trying to make TestVelocMemoryBackend.cpp with the following errors.

/home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/util/Trace.hpp(288): error: expression must have class type detected during: instantiation of "auto KokkosResilience::Util::begin_trace<TraceType,Context,Args...>(Context &, Args &&...) [with TraceType=KokkosResilience::Util::TimingTrace, Context=const char [9], Args=<>]" /home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/AutomaticCheckpoint.hpp(132): here instantiation of "void KokkosResilience::checkpoint(Context &, const std::cxx11::string &, int, F &&) [with Context=KokkosResilience::MPIContext, F=lambda []()->void]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(55): here instantiation of "void TestVelocMemoryBackend::test_layout<Layout,Context>(Context &, std::size_t, std::size_t) [with ExecSpace=Kokkos::Serial, Layout=Kokkos::LayoutRight, Context=KokkosResilience::MPIContext]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(112): here instantiation of "void TestVelocMemoryBackend_veloc_mem_Test::TestBody() [with gtestTypeParam=Kokkos::Serial]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here implicit generation of "TestVelocMemoryBackend_veloc_mem_Test::~TestVelocMemoryBackend_veloc_mem_Test() [with gtestTypeParam=Kokkos::Serial]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here [ 4 instantiation contexts not shown ] implicit generation of "testing::internal::TestFactoryImpl::~TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here implicit generation of "testing::internal::TestFactoryImpl::TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of "nv_bool testing::internal::TypeParameterizedTest<Fixture, TestSel, Types>::Register(const char , const testing::internal::CodeLocation &, const char , const char *, int, const std::vector<std::cxx11::string, std::allocator<std::cxx11::string>> &) [with Fixture=TestVelocMemoryBackend, TestSel=testing::internal::TemplateSel, Types=gtest_type_paramsTestVelocMemoryBackend]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(97): here

/home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/util/Trace.hpp(290): error: expression must have class type detected during: instantiation of "auto KokkosResilience::Util::begin_trace<TraceType,Context,Args...>(Context &, Args &&...) [with TraceType=KokkosResilience::Util::TimingTrace, Context=const char [9], Args=<>]" /home/ntan1/KokkosResilience/kokkos-resilience/src/resilience/AutomaticCheckpoint.hpp(132): here instantiation of "void KokkosResilience::checkpoint(Context &, const std::cxx11::string &, int, F &&) [with Context=KokkosResilience::MPIContext, F=lambda []()->void]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(55): here instantiation of "void TestVelocMemoryBackend::test_layout<Layout,Context>(Context &, std::size_t, std::size_t) [with ExecSpace=Kokkos::Serial, Layout=Kokkos::LayoutRight, Context=KokkosResilience::MPIContext]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(112): here instantiation of "void TestVelocMemoryBackend_veloc_mem_Test::TestBody() [with gtestTypeParam=Kokkos::Serial]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here implicit generation of "TestVelocMemoryBackend_veloc_mem_Test::~TestVelocMemoryBackend_veloc_mem_Test() [with gtestTypeParam=Kokkos::Serial]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(470): here [ 4 instantiation contexts not shown ] implicit generation of "testing::internal::TestFactoryImpl::~TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here implicit generation of "testing::internal::TestFactoryImpl::TestFactoryImpl() [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of class "testing::internal::TestFactoryImpl [with TestClass=TestVelocMemoryBackend_veloc_mem_Test]" /home/ntan1/KokkosResilience/kokkos-resilience/build/_deps/googletest-src/googletest/include/gtest/internal/gtest-internal.h(728): here instantiation of "nv_bool testing::internal::TypeParameterizedTest<Fixture, TestSel, Types>::Register(const char , const testing::internal::CodeLocation &, const char , const char *, int, const std::vector<std::cxx11::string, std::allocator<std::cxx11::string>> &) [with Fixture=TestVelocMemoryBackend, TestSel=testing::internal::TemplateSel, Types=gtest_type_paramsTestVelocMemoryBackend]" /home/ntan1/KokkosResilience/kokkos-resilience/tests/TestVelocMemoryBackend.cpp(97): here

2 errors detected in the compilation of "/tmp/tmpxft_00010e8b_00000000-6_TestVelocMemoryBackend.cpp1.ii". make[2]: [tests/CMakeFiles/resilience_tests.dir/TestVelocMemoryBackend.cpp.o] Error 1 make[1]: [tests/CMakeFiles/resilience_tests.dir/all] Error 2 make: *** [all] Error 2