Open bartlettroscoe opened 6 years ago
The creation of this issue came out of a conversation I had with @nmhamster last week about the issues with threaded testing of Trilinos and its impact on ATDM. The issue of testing the new ctest property PROCESSOR_AFFINITY
is fairly urgent because this is in the current CMake git repo 'master' branch which means that it will go out in CMake 3.12 as-is and if we don't fix any problems with it it may be hard to change after that. Also, of we are going to enable OpenMP in the CI or the auto PR build, we need to make sure we can run with tests in parallel so that we are not stuck running with ctest -j1
due to the thread binding issue mentioned above and described by @crtrott in https://github.com/trilinos/Trilinos/issues/2398#issuecomment-374379614. So we need to get on this this testing ASAP.
Below is some detailed info from @rppawlo about how to reproduce the binding of threads in multiple MPI processes to the same core.
@rppawlo,
Can you attach your complete do-configure
script for this build? Otherwise, hopefully this is a simple as using the standard SEMS CI build using:
$ cmake \
[standard CI options] \
-D Trilinos_ENABLE_OpenMP=ON \
-D MPI_EXEC_POST_NUMPROCS_FLAGS="-bind-to;core;-map-by;core" \
-D Trilinos_ENABLE_Panzer=ON -DTrilinos_ENABLE_TESTS=ON \
<trilinosDir>
$ make -j16
$ export OMP_NUM_THREADS=2
$ ctest -E ConvTest [-j16]
Below are the panzer test results for the corresponding configure file that I sent earlier. I ran with –E ConvTest to turn off the costly convergence tests. HWLOC was enabled for an OpenMP Kokkos build and mpi configured with:
-D MPI_EXEC_POST_NUMPROCS_FLAGS="-bind-to;core;-map-by;core" \
I exported OMP_NUM_THREADS=2
for all tests. This is a xeon 36 core node (72 with hyperthreads).
Without specifying the –j
flag, the tests finished in 131 seconds, running one at a time. Running same tests with –j16
took 1119 seconds. The output test timings for each test is below so you can compare.
[rppawlo@gge BUILD]$ hwloc-info
depth 0: 1 Machine (type #1)
depth 1: 2 NUMANode (type #2)
depth 2: 2 Package (type #3)
depth 3: 2 L3Cache (type #4)
depth 4: 36 L2Cache (type #4)
depth 5: 36 L1dCache (type #4)
depth 6: 36 L1iCache (type #4)
depth 7: 36 Core (type #5)
depth 8: 72 PU (type #6)
Special depth -3: 5 Bridge (type #9)
Special depth -4: 7 PCI Device (type #10)
Special depth -5: 4 OS Device (type #11)
Running with ctest -j1
:
Running with ctest -j16
:
@dsunder would be valuable to have in this conversation
Here's the configure:
Shoot, it looks like the problem with trying to run multiple MPI jobs at the same time and slowing each other down may also be a problem with CUDA on GPUs as well, as described in https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375804502.
@nmhamster, we really need to figure out how to even just manage running multiple mpi jobs on the same nodes at the same time and not have them step on each other.
CC: @rppawlo, @ambrad, @nmhamster
As described in https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375840693, it seems that indeed the Trilinos test suite using Kokkos on the GPU does not allow the tests to be run in parallel either. I think this increases the importance of this story to get this fixed once and for all.
@bartlettroscoe Do we not run the MPS server on the test machines, to let multiple MPI processes share the GPU?
Do we not run the MPS server on the test machines, to let multiple MPI processes share the GPU?
@mhoemmen, not that I know of. There is no mention of a MPS server in any of the documentation that I can find in the files:
hansen:/opt/HANSEN_INTRO
white:/opt/WHITE_INTRO
I think this is really a question for the test beds team.
@nmhamster, do you know if the Test Bed team has any plans to set up an MPS server to manage this issue on any of the Test Bed machines?
It looks like even non-threaded tests can't run in parallel of each other without slowing each other down as was demonstrated for the ATDM gnu-opt-serial
build in https://github.com/trilinos/Trilinos/issues/2455#issuecomment-376237596. In that experiment, the test Anasazi_Epetra_BlockDavidson_auxtest_MPI_4
completed in 119 seconds when run alone but took 760 seconds to complete when run with ctest -j8
on 'hansen'.
We really need to start experimenting with the update ctest program in 'master' that has the process affinity property.
@bartlettroscoe is it possible to get a detailed description of what this new process affinity feature in CMake does?
is it possible to get a detailed description of what this new process affinity feature in CMake does?
@ibaned,
We will need to talk with Brad King at Kitware. Otherwise, you and get more info by looking at:
(if you don't have access yet let me know and I can get you access).
FYI: As pointed out by @etphipp in https://github.com/trilinos/Trilinos/issues/2628#issuecomment-384347891, setting:
-D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none"
seems to fix the problem of OpenMP threads all binding to the same core on a RHEL6 machine.
Could this be a short-term solution to the problem of setting up automated builds of Trilinos with OpenMP enabled?
@bartlettroscoe yes, thats a step in the right direction. Threads will at least be able to use all the cores, although they will move around and threads from different jobs will compete if using ctest -j
. Still, you should get semi-decent results from this. I recommend dividing the argument to ctest -j
by the number of threads per process. In fact I think --bind-to none
is the best way to go until we have direct support in CMake for binding with ctest -j
.
I recommend dividing the argument to
ctest -j
by the number of threads per process. In fact I think--bind-to none
is the best way to go until we have direct support in CMake for binding withctest -j
.
@ibaned, that is basically what we have been doing up till now in the ATDM Trilinos builds and that is consistent with how we have set the CTest PROCESSORS
property. The full scope of this current Issue is to tell ctest about the total number of threads to be used and use the updated version of CMake/CTest that can set process affinity correctly.
When I get some free time on my local RHEL6 machine, I will try enabling OpenMP and setting -D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none"
and then running the entire test suite for PT packages in Trilinos for the GCC 4.8.4 and Intel 17.0.1 builds and see what that looks like.
@prwolfe We spoke today about updating the arguments for the GCC PR testing builds. When we do, and add OpenMP to one of them, we should use the argument described above.
Hmm, had not started OpenMP yet, but that would be good.
FYI: I updated the "Possible Solutions" section above for the realization that we will need to extend CTest to be more architecture aware and to be able to pass information to the MPI jobs when they start up as described in the new Kitware backlog item:
Without that, I don't think we have a shot at having multiple tests run with ctest to spread out over the hardware effectively and robustly.
FYI: As @etphipp postulates in https://github.com/trilinos/Trilinos/issues/3256#issuecomment-411542910, having the OS migrate threads and processes can be as much of a 100x hit in runtime. That is more that if the tests were just pinned to one set of cores. Therefore, getting ctest and MPI to work together to allocate jobs to nodes/sockets/cores carefully and making sure they don't move could have a very large impact on the reduction in test runtimes.
@ibaned knows things too. binding threads in an MPI process, selecting GPUs. wondering if we're basically building a scheduler here.
We discussed this at a meeting at the TUG today.
@etphipp knows some things about getting the MPI jobs and processes to bind to the cores that are desired.
@prwolfe says that the capviz staff know some things about this area. We can ask them.
@npe9 also knows a good bit about this. Set env var OMPI_MCA_mpi_paffinity_alone=0 may help fix the problems.
@ibaned says that @dsunder has been working on getting MPI jobs and processes with Kokkos to bind to the correct set of cores and devices.
FYI, the OpenMPI command-line option to bind MPI ranks to a specific set of processor IDs is --slot-list <slot list>
. If CTest could keep track of which processors each test is using, this flag specifying a disjoint set of processors for each running test in conjunction with --bind-to none
would probably work pretty well to keep tests from oversubscribing cores. If we need to control which processor ID each rank binds to, it appears the only way to do that is to use a host/rank file.
FYI, the OpenMPI command-line option to bind MPI ranks to a specific set of processor IDs is
--slot-list <slot list>
. If CTest could keep track of which processors each test is using, this flag specifying a disjoint set of processors for each running test in conjunction with--bind-to none
would probably work pretty well to keep tests from oversubscribing cores. If we need to control which processor ID each rank binds to, it appears the only way to do that is to use a host/rank file.
@etphipp, can we select a subset of more expensive/problem-some Trilinos tests (e.g. Tempus, Panzer) and write up a quick bash shell script that simulates this to see if this approach works compared to running with ctest -j<N>
(current bad implementation) and running each time one at a time?
One issue that @nmhamster noted was that we may also need to be careful about not overloading the machine's RAM. For this, we need to know the max memory usage for each machine and then let CTest manage that when it starts tests. For that, we need to measure this for each test and store it away to be used later.
This could be done done with cat /proc/<PID>/status
then grep out VmHWM
or VmRSS
. But does this trace children.
The other option is getrusage
that must get called inside each MPI process (but we could call that in GlobalMPISesssion destructor).
The SNL ATDM Tools Performance subproject has a tool to call this in MPI_Finalize() using the PMPI_Finalize() injection point!
But the question is if this will work on Cray with 'trinity' ('mutrino', etc.).
Then we also need to do this for the GPU as well to really do this correctly.
Just something to consider a a system that really allows us to use all of the hardware without crashing the tests.
FYI: I think @nmhamster and @crtrott have come up with a possible strategy for addressing the problem running multiple MPI tests on the same node and GPUs at the same time and spread out and manage resources (see TRIL-231). The idea is for ctest to be given an abstract model of the machine which would include set of homogeneous nodes of the form:
and rules about how ctest is allowed to assign this set of computing resources (i.e. nodes, cores and GPUs) to MPI processes in tests such as:
CTest already has a concept of multiple processes per test through the PROCESSORS
test property. What we would need to do is to extend ctest to define a new test property called something like THREADS_PER_PROCESS
(default 1) to allow us to tell it (on a test-by-test basis) how many threads each process uses. (And we would need something similar for accelerators like GPUs.)
With this information, ctest would be extended to keep track of the computing resources (nodes, core, and GPUs) and give them out to run tests as per the rules above (trying not to overload cores or GPUs unless explicitly requested to do so). When running a given test test, ctest would set env vars to tell a test about where it should run and then we would extend TriBITS to construct mpirun
commands on the fly that would read in these env vars and map them to the right things.
For example, to run a 2-process (PROCESSORS=2
) MPI test that uses 4 cores per process (THREADS_PER_PROCESS=4
), that all runs on the same socket sharing the same GPU, ctest would first set the env vars like:
export CTEST_TEST_NODE_0=3 # Node for 0th MPI process
export CTEST_TEST_NODE_1=3 # Node for 1st MPI process
export CTEST_TEST_CORES_0=0,1,2,3 # Cores for 0th MPI process
export CTEST_TEST_CORES_1=4,5,6,7 # Cores for 1st MPI process
export CTEST_TEST_ACCEL_0_0=1 # The 0th accelerator type (i.e. the GPU) for 0th MPI process
export CTEST_TEST_ACCEL_0_1=1 # The 0th accelerator type (i.e. the GPU) for 1st MPI process
and then we would have our test script wrapper read in these env vars and run (for OpenMPI for example):
export OMP_NUM_THREADS=4
mpirun --bind-to none \
-H ${CTEST_TEST_NODE_0} -n 1 \
-x CUDA_VISIBLE_DEVICES=${CTEST_TEST_ACCEL_0_0} \
taskset -c ${CTEST_TEST_CORES_0} <exec> : \
-H ${CTEST_TEST_NODE_1} -n 1 \
-x CUDA_VISIBLE_DEVICES=${CTEST_TEST_ACCEL_0_1} \
taskset -c ${CTEST_TEST_CORES_1} <exec>
(see CUDA_VISIBLE_DEVICES and OpenMPI mpirun command line options )
That way, that test would only run on this set of cores and GPUs on the 3rd node. We could do that mapping automatically using TriBITS through the existing functions tribits_add_test()
and tribits_add_advanced_test()
. (We would likely write a ctest -P
script to read the env vars and create and run the mpirun
command as tribits_add_advanced_test()
already writes a ctest -P
script for each test so we could just extend that.)
We need to experiment with this some on various machines to see if this will work, but if it does, that is what will be implemented by Kitware in:
and in TriBITS.
Does anyone have sufficient interest and time to help experiment with this approach in the short term? Otherwise, it will be a while before I have time to experiment with this.
Otherwise, we will document these experiments in:
and
A pretty good representation of how well we are are using our test machines when we have to use ctest -j1
:
Forward by @fryeguy52 (original source unknown).
@bartlettroscoe / @fryeguy52 - that picture could be used all over DOE :-).
CC: @KyleFromKitware
@jjellio, continuing from the discussion started in https://github.com/kokkos/kokkos/issues/3040, I did timing of the Trilinos test suite with a CUDA build on 'vortex' for the 'ats2' env and I found that raw 'jsrun' does not spread out over the 4 GPUs on a node on that system automatically. However, when I switched over to the new CTest GPU allocation approach in commit https://github.com/trilinos/Trilinos/pull/7427/commits/692e990cc74d37045c7ddcc3561920d05c48c9f0 as part of PR #7427, I got perfect scalability of the TpetraCore_gemm tests up to ctest -j4
. See the details in PR #7427. I also repeat the timing experiments done for that PR branch below.
@bartlettroscoe It depends entirely on the flags you've given to JSRUN. The issue I've linked to shows it working. It hinges on resource sets. What jsrun lines are you using?
What jsrun lines are you using?
@jjellio, I believe the same one's being used by SPARC that these were copied form. See lines starting at:
Since the CTest GPU allocation method is working so well, I would be hesitant to change what is currently in PR #7204
Yep, and those options do not specify a gpu or binding options. The lines currently used on most platforms for Trilinos testing are chosen to oversubscribe a system to get throughput.
The flags I used back then were:
jsrun -r1 -a1 -c4 -g1 -brs
-r1 = 1 resource set
-a1 = 1 task per resource set. (so 1 resource set ... I get 1 tasks total)
-c4 = 4 cores per task
-g1 = 1 GPU per task
-brs = bind to resource set (so you get a process mask that isolates resource sets)
The problem is the those flags use -a
, which forbids -p
. It could be that -a
is what made the difference, but I tend to think it was -g1
- spectrum needs to know you want a GPU.
The flags I'd use are: export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs" export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"
@jjellio,
When you say:
The lines currently used on most platforms for Trilinos testing are chosen to oversubscribe a system to get throughput.
who is doing this testing?
Otherwise, we have had problems with robustness when trying to oversubscribe on some systems (I would have to resource some).
So, I just ran on the ATS2 testbed (rzansel)
Using -r4 -c4 -g1 -brs -p1
, the jobs are serialized: (look for 'PID XYZ started', if they are parallel, you'd expect 4 PIDs starting at the start)
jjellio@rzansel46:async]$ for i in $(seq 1 20); do jsrun -r4 -c4 -g1 -brs -p1 ./runner.sh & done
[1] 92455
[2] 92456
[3] 92457
[4] 92458
[5] 92459
[jjellio@rzansel46:async]$ PID 92758 has started!
Rank: 00 Local Rank: 0 rzansel46
Cuda Devices: 0
CPU list: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Sockets: 0 NUMA list: 0
Job will sleep 26 seconds to waste time
Job jsrun started at Tue May 26 11:23:36 PDT 2020
ended at Tue May 26 11:24:02 PDT 2020
PID 92853 has started!
Rank: 00 Local Rank: 0 rzansel46
Cuda Devices: 0
CPU list: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Sockets: 0 NUMA list: 0
Job will sleep 20 seconds to waste time
Job jsrun started at Tue May 26 11:24:02 PDT 2020
ended at Tue May 26 11:24:22 PDT 2020
PID 92925 has started!
Rank: 00 Local Rank: 0 rzansel46
Cuda Devices: 0
CPU list: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Sockets: 0 NUMA list: 0
Job will sleep 17 seconds to waste time
Job jsrun started at Tue May 26 11:24:22 PDT 2020
ended at Tue May 26 11:24:39 PDT 2020
PID 92963 has started!
Rank: 00 Local Rank: 0 rzansel46
Cuda Devices: 0
CPU list: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Sockets: 0 NUMA list: 0
Job will sleep 30 seconds to waste time
Job jsrun started at Tue May 26 11:24:39 PDT 2020
ended at Tue May 26 11:25:09 PDT 2020
PID 93045 has started!
They become unserialized if you use -r1 -a1 -g1 -brs
, that's pretty obnoxious. That line works in places of -p1
.
So it would seem if you use:
export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs" export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"
Plus the Kitware/Ctest stuff it should work fine. My only skin in this game is more headaches on ATS2... I don't need anymore headaches on ATS2.
So it would seem if you use:
export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs"
export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"
Plus the Kitware/Ctest stuff it should work fine.
What is the advantage of selecting those options over what is listed in atdm/ats2/environment.sh
currently? What is broken that this is trying to fix?
My only skin in this game is more headaches on ATS2... I don't need anymore headaches on ATS2.
It is not just you and I, it is everyone running Trilinos tests on ATS-2. The ATDM Trilinos configuration should be the recommended way for people to run the Trilinos test suite on that system.
If you don't have a -g1
flag, then jsrun is not going to set Cuda visible devices.
If Kitware is going to manage the cuda device assignment, then it shouldn't matter.
-c4 -brs
is processing binding. It keeps your job on XYZ cores, and prevents other jobs from being there. I recommend anything in the range 2 to 10. I've just found 4 to be a good number. If you don't have -cN
you get exactly 1 hardware thread running your job - that isn't enough to keep the GPU happy and the host code happy. You need -c2
or more. Specify -brs
makes -c
operate on logical cores (not hardware threads), so -c2 -brs
gives you substantially more compute resources. You need a few hardware threads to keep the Nvidia threads happy (they spawn 4).
TLDR: just use -r4 -c4 -g1 -brs
.
As for oversubscription stuff:
How is the ctest work interacting with using KOKKOS_NUM_DEVICES or --kokkos-ndevices.
With jsrun, it sets CUDA_VISIBLE_DEVICES which makes kokkos-ndevices always see a zero device.
FYI, I had the same comment here: https://github.com/trilinos/Trilinos/pull/6724#issuecomment-582080188
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
This is actually really close to getting done. We just need a tribits_add[_advanced]_test( ... NUM_THREADS_PER_PROC <numThreadsPerProc> ... )
argument for OpenMP builds and I think we have it. We have CUDA builds well covered now (at least for single node testing).
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
While this has been partially addressed with the CMake Resource management and GPU limiting, the full scope of this Story has not been addressed yet (see above).
CC:: @trilinos/framework, @trilinos/kokkos, @trilinos/tpetra, @nmhamster, @rppawlo
Next Action Status
Set up a meeting to discuss current status of threaded testing in Trilinos and some steps to try to address the issues ...
Description
It seems the testing strategy in Trilinos to test with threading is to build threaded code and then run all of the threaded tests with same number of threads (such as by setting
export OMP_NUM_THREADS=2
when using OpenMP ) and then running the test suite withctest -j<N>
with that fixed number of threads. But this approach, and testing with threads enabled in general, has some issues.First, with some configurations and systems, running with any
<N>
withctest -j<N>
will result in all of the test executables binding to the same threads on the same cores making things run very slowly such as described in https://github.com/trilinos/Trilinos/issues/2398#issuecomment-374379614. A similar problem also occurs with CUDA builds with the various test processes running concurrently not spreading the load across the available GPUs (see https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375804502).Second, even when one does not experience the above problem of binding to the same cores (which is not always a problem), this approach does not make very good usage the test machine because it assumes that every MPI process is multi-threaded with Kokkos, which is not true. Even when
OPM_NUM_THREADS > 1
, there are a lot of Trilinos tests that don't have any threaded code and even if ctest allocates room for 2 threads per MPI process, only one thread will be used. So this will result in not keeping many of the cores busy running code and therefore making in tests take longer to complete.The impact of the two problems above has many developers and many automated builds having to run with a small
ctest -j<N>
(i.e.gctest -j8
is used on many of the ATDM Trilinos builds) and therefore not utilizing many of the cores that are available. This results in the time to run the full test suite going up significantly. This negatively impacts developer productivity (because they have to wait longer to get feedback from running tests locally) and this wastes existing testing hardware and/or limiting the number of builds and the number of tests that can be run in a given testing day (which reduces the number of defect that we can catch and therefore costs Trilinos developers and users time and $$).Third, having to run the entire Trilinos test suite with a fixed number of threads like with
export OMP_NUM_THREADS=2
orexport OMP_NUM_THREADS=4
does not result in very good testing or results in very expensive testing having to run the entire Trilinos test suite multiple times. It has been observed that defects occur when some thread counts are used likeexport OMP_NUM_THREADS=5
, for example. This would be like having every MPI test in Trilinos run with exactly the same number of MPI processes which would not result in very good testing (which is not the case in Trilinos as several tests are run with different numbers of MPI processes).Ideas for Possible Solutions
First, ctest needs to be extended in order to inform it of the architecture of the system were it will be running tests. CTest needs to know the number of sockets per node, the number of cores per socket, the number of threads per node, and the number of nodes. We will also need to inform CTest about the number of MPI ranks vs. threads per MPI rank for each test (i.e. add a
THREAD_PERS_PROCESS
property in addition to aPROCESSORS
property).. With that type of information, ctest should be able to determine the binding of the different ranks in an MPI job that runs a test to specific cores on sockets on nodes. And we will need to find a way to communicate this information to the MPI jobs when they are run by ctest. I think this means adding the types of process process affinity and process placement like you see from modern MPI implementations (see https://github.com/open-mpi/ompi/wiki/ProcessAffinity and https://github.com/open-mpi/ompi/wiki/ProcessPlacement ). See this Kitware backlog item.Second, we should investigate how to add a
NUM_THREADS_PER_PROC <numThreadsPerProc>
argument to the TRIBITS_ADD_TEST() and TRIBITS_ADD_ADVANCED_TEST() commands. It would be good if this could be added directly to these TriBITS functions and provide some type of "plugin" system to allow us to define how the number of threads gets set when running the individual test. But the default TriBITS implementation could just computeNUM_TOTAL_CORES_USED <numTotalCoresUsed>
from<numThreadsPerProc> * <numMpiProcs>
.The specialization of this new TriBITS functionality for Kokkos/Trilinos would set the number of requested threads based on the enabled threading model known at configure time. For OpenMP, it would set the env var
OMP_NUM_THREADS=<numThreads>
and for other threading models would pass in--kokkos-threads=<numThreads>
. If theNUM_THREADS_PER_PROC <numThreadsPerProc>
argument was missing, this could use a default number of threads (e.g. global configure argumentTrilinos_DEFAULT_NUM_THREADS
with default1
). If the computed<numTotalCoresUsed>
was larger than${MPI_EXEC_MAX_NUMPROCS}
(which should be set to the maximum number of threads that can be used run on that machine when using threading), then the test would get excluded and a message would be printed to STDOUT. Such a CMake function should be pretty easy to write, if you know the threading model used by Kokkos at configure time.Definition of Done:
Tasks: