Closed ElisabethGiem closed 1 year ago
@nphtan did this fix your issue?
It builds and runs the tests fine but using the VeloC backend (automatic checkpointing) results in errors due to empty memory region selection. It's not capturing and registering the Views with VeloC.
It builds and runs the tests fine but using the VeloC backend (automatic checkpointing) results in errors due to empty memory region selection. It's not capturing and registering the Views with VeloC.
Perfect -- check out https://github.com/kokkos/kokkos-resilience/blob/main/tests/TestLambdaCapture.cpp#L77
We had to change the API; you have to add the view subscriber to your view types to enable the capture.
The most recent commit on this branch, if merged into main, brings it in line with the current Kokkos develop branch. We still have a few headers to purge in line with Kokkos PR#4865, but it's functional.
I just want to make sure I didn't miss a significant change on my pass through the code changes; would you say the following was a correct summary of the changes I should find in compare?
Questions:
It appears KR_ENABLE_VELOC
was removed: was some other provision made for those downstream dependencies or were they removed from the code totally? I don't work with VeloC so I don't have any knowledge here: I only see that the relevant VeloC cmake, etc., was removed as well, but did it break whatever code was relying on it?
Why was the status
flag removed from the try-catch
in the StdFileBackend
checkpoint
and restart
in src/resilience/stdfile/StdFileBackend.cpp
?
These changes seem to also update to work with the version of VeloC in your VeloC PR 43, was that intentional? I'm reviewing with the assumption that it was, though it might be good to wait until that is merged in.
find_dependency(veloc)
instead of find_dependency(VeloC)
)The test framework is failing in two ways, but they seem unrelated to these changes. Just listing here for reference so no one else needs to track down the sources when testing this:
TestResOpenMP.TestResilientForInsertError
caused by a multiple of three number of OpenMP threads. I think we could use omp_get/set_num_threads to make this more reliable.Other than these, I think the code looks good.
https://github.com/kokkos/kokkos-resilience/blob/main/src/resilience/AutomaticCheckpoint.hpp#L82
Should be updated to use KOKKOS_IF_ON_HOST
- I think we could use omp_get/set_num_threads to make this more reliable.
The last advice I received regarding this (I also attempted to import a thread randomizer into this version of googletest--the limiting factor--and many other options, before settling for the comment text warning for now, which I agree is a temporary solution at best) was that many unfortunate problems could be caused by changing the thread pool in the execution space after Kokkos initialization and using the OpenMP execution space (resilient OpenMP is directly based off it).
A more robust error injection would be ideal, but the test served its purpose of proving that the subscriber can detect problematic data.
Well the already bloated PR got a bit more bloated. I also changed some usages of boost::optional
and boost::variant
to std::optional
and std::variant
Matthew's major concern from the previous review (cmake/resilienceConfig.cmake.in should be updated to the new syntax for veloc ) was not addressed in the files that came with this branch for me. Other than that, the new commits seem fine to me.
Matthew's major concern from the previous review (cmake/resilienceConfig.cmake.in should be updated to the new syntax for veloc ) was not addressed in the files that came with this branch for me. Other than that, the new commits seem fine to me.
Woops, forgot that, fixed now!
Solution to Issue 40, update code to work with newer versions of kokkos develop.