Closed ElisabethGiem closed 2 years ago
Build Kokkos https://github.com/nmm0/kokkos/tree/accessor-hooks with GCC 7.5. (ALL: Please put build instructions when reporting error!)
knteran@klogin3 BUILD_KOKKOS]$ git clone git@github.com:nmm0/kokkos.git
knteran@klogin3 BUILD_KOKKOS]$ cd kokkos
knteran@klogin3 BUILD_KOKKOS]$ git checkout accessor-hooks
knteran@klogin3 BUILD_KOKKOS]$ cmake ../kokkos/ -DKokkos_ENABLE_OPENMP=ON -DKokkos_ARCH_HSW=ON -DCMAKE_INSTALL_PREFIX=/home/knteran/Kokkos_Haswell_75
-- Setting default Kokkos CXX standard to 11
-- Setting policy CMP0074 to use <Package>_ROOT variables
-- The project name is: Kokkos
-- Using -std=gnu++11 for C++11 extensions as feature
-- Execution Spaces:
-- Device Parallel: NONE
-- Host Parallel: OPENMP
-- Host Serial: NONE
--
-- Architectures:
-- Configuring done
-- Generating done
-- Build files have been written to: /home/knteran/ASC/BUILD_KOKKOS
[knteran@klogin3 BUILD_KOKKOS]$ make -j 8
:
:
[ 90%] Built target kokkoscore
[100%] Built target kokkoscontainers
[knteran@klogin3 BUILD_KOKKOS]$ make install
Build heat-dist code (non resilient)
add set(CMAKE_PREFIX_PATH /home/knteran/Kokkos_Haswell_75)
in veloc_heat_dist/CMakeLists.txt
Comment out heat_dist_resil stuff in CMakeLists.txt
I successfully built non-resilient version and it runs!
However, I am failing to build kokkos-resilience (resilient-execution-space branch) with Nic's Kokkos. Large fraction of Jeff's source has been outdated.
I modified CMakeLists.txt under src/resilience, tests and examples directory to disable all Jeff's manual checkpointing code, tests and examples.
@ElisabethGiem, Can I update CMakeLists.txt files to disable all Jeff's stuff?
[knteran@klogin3 ]$ cd kokkos-resilience
[knteran@klogin3 ]$ mkdir BUILD
[knteran@klogin3 ]$ cd BUILD
[knteran@klogin3 BUILD]$ cmake -DCMAKE_BUILD_TYPE=Release -DKokkos_ROOT=/home/knteran/Kokkos_Haswell_75/ -DCMAKE_INSTALL_PREFIX=/home/knteran/resilience_Haswell_75 -DKR_ENABLE_MPI_BACKENDS=OFF -DKR_ENABLE_STDIO=OFF ..
Building Liz's resilient execution space.
Changing the CMakeList.txt of veloc_heat_dist
cmake_minimum_required(VERSION 3.19)
project(heatdis)
set(CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/Modules")
set(CMAKE_PREFIX_PATH /home/knteran/Kokkos_Haswell_75;resilience_Haswell_75 )
add_executable(heatdis)
add_executable(heatdis_resil)
add_subdirectory(src)
find_package(Kokkos REQUIRED)
find_package(resilience REQUIRED)
target_link_libraries(heatdis PRIVATE Kokkos::kokkos)
target_link_libraries(heatdis_resil PRIVATE Kokkos::resilience Kokkos::kokkos)
#target_compile_definitions(heatdis_resil PRIVATE USE_RESILIENT_EXEC)
add_subdirectory(tpl)
# Install rules
include(GNUInstallDirs)
install(TARGETS heatdis)
install(TARGETS heatdis_resil heatdis)
The gtest program in kokkos-resilience repo passes! However, there are no example programs. I strongly recommend adding this (I can do that ). Currently working to add resilience execution space to heat-dist. @ElisabethGiem Do you have your version of heat-dist with resilient execution space?
Now, I confirm that the program hangs up.
[knteran@klogin2 BUILD]$ ./heatdis
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Local data size is 2560 x 2563 = 100.000000 MB (100).
Target precision : 0.000010
Maximum number of iterations : 600
i: 0 --- v: 0
Step : 0, error = 1.000000
Step : 50, error = 0.484743
Step : 100, error = 0.242139
Step : 150, error = 0.161172
Step : 200, error = 0.121036
Step : 250, error = 0.096793
Step : 300, error = 0.080644
Step : 350, error = 0.069129
Step : 400, error = 0.060499
Step : 450, error = 0.053781
Step : 500, error = 0.048396
Step : 550, error = 0.043974
Execution finished in 12.992989 seconds.
[knteran@klogin2 BUILD]$ ./heatdis_resil
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Local data size is 2560 x 2563 = 100.000000 MB (100).
Target precision : 0.000010
Maximum number of iterations : 600
i: 0 --- v: 0
Step : 0, error = 1.000000
--- wait, it is making a progress, but extremely slow... Something weird is happening.
In heat-dist program, I modified the program to introduce the regular range policy except the parallel_for copying the data, h(i)=g(i);
. Interestingly, this part becomes extremely slow. I do not know why this is happening...
@ElisabethGiem Do you expect the resilient parallel_for (for OpenMP) call exec_range() in the original Kokkos instead of the one defined in resilient parallel_for?
I observed resilient parallel for execute non-resilient parallel_for 3 times at the beginning when the program is linked outside kokkos_resilience repository. @nmm0 I need your help. Something is messed up with namespace?
When heat_dist code see the resilient parallel_for, it goes to OpenMPResParallel.hpp. Interestingly, the creation of m_functor_0,1,2 triggers ParallelFor of non-resilient execution space. I have no idea why!! This won't happen if executed inside gtest. (@ElisabethGiem, I see why you are very confident with the test.).
KokkosResilience::ResilientDuplicatesSubscriber::in_resilient_parallel_loop = true;
auto m_functor_0 = m_functor;
auto m_functor_1 = m_functor;
auto m_functor_2 = m_functor;
KokkosResilience::ResilientDuplicatesSubscriber::in_resilient_parallel_loop = false;
Why does the cop constructor triggers an execution of Parallel_For
I added an example program under example.
#include <Kokkos_Core.hpp>
#include <resilience/Resilience.hpp>
#include <resilience/openMP/ResHostSpace.hpp>
#include <resilience/openMP/ResOpenMP.hpp>
#define MemSpace KokkosResilience::ResHostSpace
#define ExecSpace KokkosResilience::ResOpenMP
int main( int argc, char **argv )
{
Kokkos::initialize( argc, argv );
{
// range policy with resilient execution space
using range_policy = Kokkos::RangePolicy<ExecSpace>;
// test vector types with the duplicating subscriber
using subscriber_vector_double_type = Kokkos::View< double* , MemSpace,
Kokkos::Experimental::SubscribableViewHooks<
KokkosResilience::ResilientDuplicatesSubscriber > >;
int dim0 = 100, dim1 = 5;
subscriber_vector_double_type view( "test_view", dim0 );
Kokkos::parallel_for( range_policy (0, dim0), KOKKOS_LAMBDA ( const int i) {
view ( i ) = i;
});
// Data is in host space. It's OK to access with regular loops
for ( int i = 0; i < dim0; i++) {
std::cout << "view(" << i << ") = " << view(i) << std::endl;
}
}
Kokkos::finalize();
return 0;
}
This is a line I added to the CMakeLists.txt in the same directory to create an executable under my build/example directory:
add_example(simple_res_openmp SOURCES SimpleResOpenMP.cpp)
Very strange.... I do not see any strange behaviors..... I suspect weird thing happening at install time of kokkos_resilience or build time of heat-dist. Let me write heat-dist inside kokkos_resilience.
I think I found a bug. This is related to duplication of the views. I noticed that parallel_for calls non-resilient parallel_for when copying the functor. It seems the size of views is triggering this problem. For small views, it calls non-resilient parallel_for once to verify the computation at the end. However, the program is messed up with the large views. @ElisabethGiem is it how you duplicate large views? If this is the case, that's OK. However, I found it slows down the program dramatically...
Here is my code (range_policy is set to ResOpenMP )
for ( dim0 = 10000; dim0 < 12000; dim0++ ) {
std::cout << "view_size " << dim0 << std::endl;
subscriber_vector_double_type view( "test_view", dim0 );
Kokkos::parallel_for( range_policy (0, dim0), KOKKOS_LAMBDA ( const int i) {
view ( i ) = i+8;
});
}
The output looks like (I put a print statement in the beginning of execute() of parallel_for):
view_size 10238
PARALLEL FOR with Normal Range Policy
view_size 10239
PARALLEL FOR with Normal Range Policy
view_size 10240
PARALLEL FOR with Normal Range Policy
PARALLEL FOR with Normal Range Policy
PARALLEL FOR with Normal Range Policy
PARALLEL FOR with Normal Range Policy
It seems it is how Kokkos executes deep copy for OpenMP. For small data smaller than 10240 and it does sequential copy. However, it executes parallel_for for large views...
I switched all the compilers GCC 10.2.0 and built all the source (including Boost) to make the program running. I will change the title to "performance and potential bugs in data duplication." I suggest applying resilient execution space selectively because triplicating non-computing loops does not make sense.
Here is the result with 16MB of data, 600 iterations with 28 threads. heatdis 2.21 seconds. heatdis_resil 3712 seconds. The output are correct I checked 16MB and 1MB cases). However, it is 1500x slower. We need to investigate the slowdown as I expect 3-5X at worst. (I will submit the issue) .
I switched to the latest version updated. Then, the code fails with heat_dist. See my report: https://github.com/kokkos/kokkos-resilience/pull/14
Closing the issues and I will open another issue to discuss the all task associated with resilient execution space.
The issue is that RangePolicy resilient parallel_for with single-dimensional resilient views and no MPI fails in the heat distribution test on kahuna. It appears to not run at all/enter infinite loop (program times out), although the precise moment of failure is yet to be determined.
Modules loaded: cmake 3.19.1 gcc7-support gcc 7.5.0
Notes: 0) Branch of Resilient Kokkos: resilient-execution-space 1) No MPI version of heat distribution test works with non-resilient Kokkos 2) heatdist test code: https://github.com/nmm0/veloc-heat-test/commit/ef7a94bb2bf065817c78ed867e1eecd1825ce0d5