kokkos / kokkos-resilience

Resilience Extensions for Kokkos
Other
4 stars 2 forks source link

update to Kokkos 4.0, C++17, and latest VeloC #41

Closed ElisabethGiem closed 1 year ago

ElisabethGiem commented 2 years ago

Solution to Issue 40, update code to work with newer versions of kokkos develop.

nmm0 commented 2 years ago

@nphtan did this fix your issue?

nphtan commented 2 years ago

It builds and runs the tests fine but using the VeloC backend (automatic checkpointing) results in errors due to empty memory region selection. It's not capturing and registering the Views with VeloC.

nmm0 commented 2 years ago

It builds and runs the tests fine but using the VeloC backend (automatic checkpointing) results in errors due to empty memory region selection. It's not capturing and registering the Views with VeloC.

Perfect -- check out https://github.com/kokkos/kokkos-resilience/blob/main/tests/TestLambdaCapture.cpp#L77

We had to change the API; you have to add the view subscriber to your view types to enable the capture.

ElisabethGiem commented 2 years ago

The most recent commit on this branch, if merged into main, brings it in line with the current Kokkos develop branch. We still have a few headers to purge in line with Kokkos PR#4865, but it's functional.

ElisabethGiem commented 1 year ago

I just want to make sure I didn't miss a significant change on my pass through the code changes; would you say the following was a correct summary of the changes I should find in compare?

Questions:

It appears KR_ENABLE_VELOC was removed: was some other provision made for those downstream dependencies or were they removed from the code totally? I don't work with VeloC so I don't have any knowledge here: I only see that the relevant VeloC cmake, etc., was removed as well, but did it break whatever code was relying on it?

Why was the status flag removed from the try-catch in the StdFileBackend checkpoint and restart in src/resilience/stdfile/StdFileBackend.cpp?

Matthew-Whitlock commented 1 year ago

These changes seem to also update to work with the version of VeloC in your VeloC PR 43, was that intentional? I'm reviewing with the assumption that it was, though it might be good to wait until that is merged in.

  1. Major: cmake/resilienceConfig.cmake.in should be updated to the new syntax for veloc (find_dependency(veloc) instead of find_dependency(VeloC))
  2. Minor: Since this requires using the newest version of VeloC, and VeloC barebones is not being maintained, we should remove references to it in CMakeLists, CMake config, cmake presets, and readme. The readme mentions spack options for compiling VeloC to barebones - does the spack stuff also need to be updated?

The test framework is failing in two ways, but they seem unrelated to these changes. Just listing here for reference so no one else needs to track down the sources when testing this:

Other than these, I think the code looks good.

Matthew-Whitlock commented 1 year ago

https://github.com/kokkos/kokkos-resilience/blob/main/src/resilience/AutomaticCheckpoint.hpp#L82

Should be updated to use KOKKOS_IF_ON_HOST

ElisabethGiem commented 1 year ago
  • I think we could use omp_get/set_num_threads to make this more reliable.

The last advice I received regarding this (I also attempted to import a thread randomizer into this version of googletest--the limiting factor--and many other options, before settling for the comment text warning for now, which I agree is a temporary solution at best) was that many unfortunate problems could be caused by changing the thread pool in the execution space after Kokkos initialization and using the OpenMP execution space (resilient OpenMP is directly based off it).

A more robust error injection would be ideal, but the test served its purpose of proving that the subscriber can detect problematic data.

nmm0 commented 1 year ago

Well the already bloated PR got a bit more bloated. I also changed some usages of boost::optional and boost::variant to std::optional and std::variant

ElisabethGiem commented 1 year ago

Matthew's major concern from the previous review (cmake/resilienceConfig.cmake.in should be updated to the new syntax for veloc ) was not addressed in the files that came with this branch for me. Other than that, the new commits seem fine to me.

nmm0 commented 1 year ago

Matthew's major concern from the previous review (cmake/resilienceConfig.cmake.in should be updated to the new syntax for veloc ) was not addressed in the files that came with this branch for me. Other than that, the new commits seem fine to me.

Woops, forgot that, fixed now!