Define better strategy for managing threaded testing with Trilinos

bartlettroscoe commented 6 years ago

CC:: @trilinos/framework, @trilinos/kokkos, @trilinos/tpetra, @nmhamster, @rppawlo

Next Action Status

Set up a meeting to discuss current status of threaded testing in Trilinos and some steps to try to address the issues ...

Description

It seems the testing strategy in Trilinos to test with threading is to build threaded code and then run all of the threaded tests with same number of threads (such as by setting export OMP_NUM_THREADS=2 when using OpenMP ) and then running the test suite with ctest -j<N> with that fixed number of threads. But this approach, and testing with threads enabled in general, has some issues.

First, with some configurations and systems, running with any <N> with ctest -j<N> will result in all of the test executables binding to the same threads on the same cores making things run very slowly such as described in https://github.com/trilinos/Trilinos/issues/2398#issuecomment-374379614. A similar problem also occurs with CUDA builds with the various test processes running concurrently not spreading the load across the available GPUs (see https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375804502).

Second, even when one does not experience the above problem of binding to the same cores (which is not always a problem), this approach does not make very good usage the test machine because it assumes that every MPI process is multi-threaded with Kokkos, which is not true. Even when OPM_NUM_THREADS > 1, there are a lot of Trilinos tests that don't have any threaded code and even if ctest allocates room for 2 threads per MPI process, only one thread will be used. So this will result in not keeping many of the cores busy running code and therefore making in tests take longer to complete.

The impact of the two problems above has many developers and many automated builds having to run with a small ctest -j<N> (i.e.g ctest -j8 is used on many of the ATDM Trilinos builds) and therefore not utilizing many of the cores that are available. This results in the time to run the full test suite going up significantly. This negatively impacts developer productivity (because they have to wait longer to get feedback from running tests locally) and this wastes existing testing hardware and/or limiting the number of builds and the number of tests that can be run in a given testing day (which reduces the number of defect that we can catch and therefore costs Trilinos developers and users time and $$).

Third, having to run the entire Trilinos test suite with a fixed number of threads like with export OMP_NUM_THREADS=2 or export OMP_NUM_THREADS=4 does not result in very good testing or results in very expensive testing having to run the entire Trilinos test suite multiple times. It has been observed that defects occur when some thread counts are used like export OMP_NUM_THREADS=5, for example. This would be like having every MPI test in Trilinos run with exactly the same number of MPI processes which would not result in very good testing (which is not the case in Trilinos as several tests are run with different numbers of MPI processes).

Ideas for Possible Solutions

First, ctest needs to be extended in order to inform it of the architecture of the system were it will be running tests. CTest needs to know the number of sockets per node, the number of cores per socket, the number of threads per node, and the number of nodes. We will also need to inform CTest about the number of MPI ranks vs. threads per MPI rank for each test (i.e. add a THREAD_PERS_PROCESS property in addition to a PROCESSORS property).. With that type of information, ctest should be able to determine the binding of the different ranks in an MPI job that runs a test to specific cores on sockets on nodes. And we will need to find a way to communicate this information to the MPI jobs when they are run by ctest. I think this means adding the types of process process affinity and process placement like you see from modern MPI implementations (see https://github.com/open-mpi/ompi/wiki/ProcessAffinity and https://github.com/open-mpi/ompi/wiki/ProcessPlacement ). See this Kitware backlog item.

Second, we should investigate how to add a NUM_THREADS_PER_PROC <numThreadsPerProc> argument to the TRIBITS_ADD_TEST() and TRIBITS_ADD_ADVANCED_TEST() commands. It would be good if this could be added directly to these TriBITS functions and provide some type of "plugin" system to allow us to define how the number of threads gets set when running the individual test. But the default TriBITS implementation could just compute NUM_TOTAL_CORES_USED <numTotalCoresUsed> from <numThreadsPerProc> * <numMpiProcs>.

The specialization of this new TriBITS functionality for Kokkos/Trilinos would set the number of requested threads based on the enabled threading model known at configure time. For OpenMP, it would set the env var OMP_NUM_THREADS=<numThreads> and for other threading models would pass in --kokkos-threads=<numThreads>. If the NUM_THREADS_PER_PROC <numThreadsPerProc> argument was missing, this could use a default number of threads (e.g. global configure argument Trilinos_DEFAULT_NUM_THREADS with default 1). If the computed <numTotalCoresUsed> was larger than ${MPI_EXEC_MAX_NUMPROCS} (which should be set to the maximum number of threads that can be used run on that machine when using threading), then the test would get excluded and a message would be printed to STDOUT. Such a CMake function should be pretty easy to write, if you know the threading model used by Kokkos at configure time.

Definition of Done:

???

Tasks:

???

bartlettroscoe commented 6 years ago

The creation of this issue came out of a conversation I had with @nmhamster last week about the issues with threaded testing of Trilinos and its impact on ATDM. The issue of testing the new ctest property PROCESSOR_AFFINITY is fairly urgent because this is in the current CMake git repo 'master' branch which means that it will go out in CMake 3.12 as-is and if we don't fix any problems with it it may be hard to change after that. Also, of we are going to enable OpenMP in the CI or the auto PR build, we need to make sure we can run with tests in parallel so that we are not stuck running with ctest -j1 due to the thread binding issue mentioned above and described by @crtrott in https://github.com/trilinos/Trilinos/issues/2398#issuecomment-374379614. So we need to get on this this testing ASAP.

bartlettroscoe commented 6 years ago

Below is some detailed info from @rppawlo about how to reproduce the binding of threads in multiple MPI processes to the same core.

@rppawlo,

Can you attach your complete do-configure script for this build? Otherwise, hopefully this is a simple as using the standard SEMS CI build using:

$ cmake \
  [standard CI options] \
  -D Trilinos_ENABLE_OpenMP=ON \
  -D MPI_EXEC_POST_NUMPROCS_FLAGS="-bind-to;core;-map-by;core" \
  -D Trilinos_ENABLE_Panzer=ON -DTrilinos_ENABLE_TESTS=ON \ 
  <trilinosDir>

$ make -j16

$ export OMP_NUM_THREADS=2

$ ctest -E ConvTest [-j16]

Below are the panzer test results for the corresponding configure file that I sent earlier. I ran with –E ConvTest to turn off the costly convergence tests. HWLOC was enabled for an OpenMP Kokkos build and mpi configured with:

-D MPI_EXEC_POST_NUMPROCS_FLAGS="-bind-to;core;-map-by;core" \

I exported OMP_NUM_THREADS=2 for all tests. This is a xeon 36 core node (72 with hyperthreads).

Without specifying the –j flag, the tests finished in 131 seconds, running one at a time. Running same tests with –j16 took 1119 seconds. The output test timings for each test is below so you can compare.

[rppawlo@gge BUILD]$ hwloc-info 
depth 0: 1 Machine (type #1)
 depth 1:               2 NUMANode (type #2)
  depth 2:              2 Package (type #3)
   depth 3:             2 L3Cache (type #4)
    depth 4:            36 L2Cache (type #4)
     depth 5:           36 L1dCache (type #4)
      depth 6:          36 L1iCache (type #4)
       depth 7:         36 Core (type #5)
        depth 8:        72 PU (type #6)
Special depth -3: 5 Bridge (type #9)
Special depth -4: 7 PCI Device (type #10)
Special depth -5: 4 OS Device (type #11)

Running with ctest -j1:

`ctest -E ConvTest` results (click to expand)

``` [rppawlo@gge panzer]$ ctest -E ConvTest Test project /ascldap/users/rppawlo/BUILD/packages/panzer Start 1: PanzerCore_version_MPI_1 1/132 Test #1: PanzerCore_version_MPI_1 ......................................... Passed 0.16 sec Start 2: PanzerCore_string_utilities_MPI_1 2/132 Test #2: PanzerCore_string_utilities_MPI_1 ................................ Passed 0.14 sec Start 3: PanzerCore_hash_utilities_MPI_1 3/132 Test #3: PanzerCore_hash_utilities_MPI_1 .................................. Passed 0.14 sec Start 4: PanzerCore_memUtils_MPI_1 4/132 Test #4: PanzerCore_memUtils_MPI_1 ........................................ Passed 0.14 sec Start 5: PanzerDofMgr_tFieldPattern_MPI_4 5/132 Test #5: PanzerDofMgr_tFieldPattern_MPI_4 ................................. Passed 0.25 sec Start 6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4 6/132 Test #6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4 ..................... Passed 0.26 sec Start 7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4 7/132 Test #7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4 ......................... Passed 0.25 sec Start 8: PanzerDofMgr_tNodalFieldPattern_MPI_4 8/132 Test #8: PanzerDofMgr_tNodalFieldPattern_MPI_4 ............................ Passed 0.25 sec Start 9: PanzerDofMgr_tFieldAggPattern_MPI_4 9/132 Test #9: PanzerDofMgr_tFieldAggPattern_MPI_4 .............................. Passed 0.25 sec Start 10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2 10/132 Test #10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2 ................. Passed 0.25 sec Start 11: PanzerDofMgr_tBlockedDOFManager_MPI_2 11/132 Test #11: PanzerDofMgr_tBlockedDOFManager_MPI_2 ............................ Passed 0.24 sec Start 12: PanzerDofMgr_tOrientations_MPI_1 12/132 Test #12: PanzerDofMgr_tOrientations_MPI_1 ................................. Passed 0.23 sec Start 13: PanzerDofMgr_tFilteredUGI_MPI_2 13/132 Test #13: PanzerDofMgr_tFilteredUGI_MPI_2 .................................. Passed 0.24 sec Start 14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4 14/132 Test #14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4 .................. Passed 0.27 sec Start 15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4 15/132 Test #15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4 .................... Passed 0.26 sec Start 16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4 16/132 Test #16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4 ........................... Passed 0.26 sec Start 17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4 17/132 Test #17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4 .................... Passed 0.25 sec Start 18: PanzerDofMgr_tFieldPattern2_MPI_4 18/132 Test #18: PanzerDofMgr_tFieldPattern2_MPI_4 ................................ Passed 0.25 sec Start 19: PanzerDofMgr_tFieldAggPattern2_MPI_4 19/132 Test #19: PanzerDofMgr_tFieldAggPattern2_MPI_4 ............................. Passed 0.25 sec Start 20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4 20/132 Test #20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4 ........................... Passed 0.26 sec Start 21: PanzerDofMgr_scaling_test 21/132 Test #21: PanzerDofMgr_scaling_test ........................................ Passed 1.23 sec Start 22: PanzerDiscFE_integration_rule_MPI_1 22/132 Test #22: PanzerDiscFE_integration_rule_MPI_1 .............................. Passed 0.28 sec Start 23: PanzerDiscFE_integration_values2_MPI_1 23/132 Test #23: PanzerDiscFE_integration_values2_MPI_1 ........................... Passed 0.28 sec Start 24: PanzerDiscFE_dimension_MPI_1 24/132 Test #24: PanzerDiscFE_dimension_MPI_1 ..................................... Passed 0.27 sec Start 25: PanzerDiscFE_basis_MPI_1 25/132 Test #25: PanzerDiscFE_basis_MPI_1 ......................................... Passed 0.27 sec Start 26: PanzerDiscFE_basis_values2_MPI_1 26/132 Test #26: PanzerDiscFE_basis_values2_MPI_1 ................................. Passed 0.30 sec Start 27: PanzerDiscFE_point_values2_MPI_1 27/132 Test #27: PanzerDiscFE_point_values2_MPI_1 ................................. Passed 0.28 sec Start 28: PanzerDiscFE_boundary_condition_MPI_1 28/132 Test #28: PanzerDiscFE_boundary_condition_MPI_1 ............................ Passed 0.28 sec Start 29: PanzerDiscFE_material_model_entry_MPI_1 29/132 Test #29: PanzerDiscFE_material_model_entry_MPI_1 .......................... Passed 0.27 sec Start 30: PanzerDiscFE_stlmap_utilities_MPI_1 30/132 Test #30: PanzerDiscFE_stlmap_utilities_MPI_1 .............................. Passed 0.27 sec Start 31: PanzerDiscFE_shards_utilities_MPI_1 31/132 Test #31: PanzerDiscFE_shards_utilities_MPI_1 .............................. Passed 0.27 sec Start 32: PanzerDiscFE_evaluators_MPI_1 32/132 Test #32: PanzerDiscFE_evaluators_MPI_1 .................................... Passed 0.28 sec Start 33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1 33/132 Test #33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1 ............ Passed 0.27 sec Start 34: PanzerDiscFE_zero_sensitivities_MPI_1 34/132 Test #34: PanzerDiscFE_zero_sensitivities_MPI_1 ............................ Passed 0.27 sec Start 35: PanzerDiscFE_output_stream_MPI_1 35/132 Test #35: PanzerDiscFE_output_stream_MPI_1 ................................. Passed 0.27 sec Start 36: PanzerDiscFE_global_data_MPI_1 36/132 Test #36: PanzerDiscFE_global_data_MPI_1 ................................... Passed 0.27 sec Start 37: PanzerDiscFE_parameter_library_MPI_1 37/132 Test #37: PanzerDiscFE_parameter_library_MPI_1 ............................. Passed 0.28 sec Start 38: PanzerDiscFE_cell_topology_info_MPI_1 38/132 Test #38: PanzerDiscFE_cell_topology_info_MPI_1 ............................ Passed 0.27 sec Start 39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1 39/132 Test #39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1 ................ Passed 0.27 sec Start 40: PanzerDiscFE_view_factory_MPI_1 40/132 Test #40: PanzerDiscFE_view_factory_MPI_1 .................................. Passed 0.27 sec Start 41: PanzerDiscFE_check_bc_consistency_MPI_1 41/132 Test #41: PanzerDiscFE_check_bc_consistency_MPI_1 .......................... Passed 0.27 sec Start 42: PanzerDiscFE_equation_set_MPI_1 42/132 Test #42: PanzerDiscFE_equation_set_MPI_1 .................................. Passed 0.27 sec Start 43: PanzerDiscFE_equation_set_composite_factory_MPI_1 43/132 Test #43: PanzerDiscFE_equation_set_composite_factory_MPI_1 ................ Passed 0.28 sec Start 44: PanzerDiscFE_closure_model_MPI_1 44/132 Test #44: PanzerDiscFE_closure_model_MPI_1 ................................. Passed 0.28 sec Start 45: PanzerDiscFE_closure_model_composite_MPI_1 45/132 Test #45: PanzerDiscFE_closure_model_composite_MPI_1 ....................... Passed 0.28 sec Start 46: PanzerDiscFE_physics_block_MPI_1 46/132 Test #46: PanzerDiscFE_physics_block_MPI_1 ................................. Passed 0.29 sec Start 47: PanzerDiscFE_tEpetraGather_MPI_4 47/132 Test #47: PanzerDiscFE_tEpetraGather_MPI_4 ................................. Passed 0.30 sec Start 48: PanzerDiscFE_tEpetraScatter_MPI_4 48/132 Test #48: PanzerDiscFE_tEpetraScatter_MPI_4 ................................ Passed 0.30 sec Start 49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4 49/132 Test #49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4 ....................... Passed 0.30 sec Start 50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2 50/132 Test #50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2 ........................ Passed 0.37 sec Start 51: PanzerDiscFE_tCloneLOF_MPI_2 51/132 Test #51: PanzerDiscFE_tCloneLOF_MPI_2 ..................................... Passed 0.30 sec Start 52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2 52/132 Test #52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2 ....................... Passed 0.30 sec Start 53: PanzerDiscFE_NormalsEvaluator_MPI_1 53/132 Test #53: PanzerDiscFE_NormalsEvaluator_MPI_1 .............................. Passed 0.28 sec Start 54: PanzerDiscFE_IntegratorScalar_MPI_1 54/132 Test #54: PanzerDiscFE_IntegratorScalar_MPI_1 .............................. Passed 0.28 sec Start 55: PanzerDiscFE_GatherCoordinates_MPI_1 55/132 Test #55: PanzerDiscFE_GatherCoordinates_MPI_1 ............................. Passed 0.28 sec Start 56: PanzerDiscFE_DOF_PointFields_MPI_1 56/132 Test #56: PanzerDiscFE_DOF_PointFields_MPI_1 ............................... Passed 0.28 sec Start 57: PanzerDiscFE_DOF_BasisToBasis_MPI_1 57/132 Test #57: PanzerDiscFE_DOF_BasisToBasis_MPI_1 .............................. Passed 0.28 sec Start 58: PanzerDiscFE_point_descriptor_MPI_1 58/132 Test #58: PanzerDiscFE_point_descriptor_MPI_1 .............................. Passed 0.27 sec Start 59: PanzerAdaptersSTK_tSTKInterface_MPI_1 59/132 Test #59: PanzerAdaptersSTK_tSTKInterface_MPI_1 ............................ Passed 1.17 sec Start 60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2 60/132 Test #60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2 ......................... Passed 1.18 sec Start 61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2 61/132 Test #61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2 ................... Passed 1.25 sec Start 62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2 62/132 Test #62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2 .................... Passed 1.18 sec Start 63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2 63/132 Test #63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2 ...................... Passed 2.10 sec Start 64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2 64/132 Test #64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2 ...................... Passed 1.33 sec Start 65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4 65/132 Test #65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4 ........... Passed 1.19 sec Start 66: PanzerAdaptersSTK_tSTK_IO_MPI_1 66/132 Test #66: PanzerAdaptersSTK_tSTK_IO_MPI_1 .................................. Passed 1.22 sec Start 67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2 67/132 Test #67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2 ..................... Passed 1.22 sec Start 68: PanzerAdaptersSTK_tGhosting_MPI_4 68/132 Test #68: PanzerAdaptersSTK_tGhosting_MPI_4 ................................ Passed 1.19 sec Start 69: PanzerAdaptersSTK_tSTKConnManager_MPI_2 69/132 Test #69: PanzerAdaptersSTK_tSTKConnManager_MPI_2 .......................... Passed 1.19 sec Start 70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2 70/132 Test #70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2 ................ Passed 1.22 sec Start 71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2 71/132 Test #71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2 ................. Passed 1.18 sec Start 72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2 72/132 Test #72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2 ................. Passed 1.19 sec Start 73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2 73/132 Test #73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2 ..................... Passed 1.18 sec Start 74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2 74/132 Test #74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2 ................... Passed 1.22 sec Start 75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1 75/132 Test #75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1 ...... Passed 1.16 sec Start 76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2 76/132 Test #76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2 ................ Passed 1.15 sec Start 77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 77/132 Test #77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 ................. Passed 1.47 sec Start 78: PanzerAdaptersSTK_workset_builder_MPI_1 78/132 Test #78: PanzerAdaptersSTK_workset_builder_MPI_1 .......................... Passed 1.21 sec Start 79: PanzerAdaptersSTK_d_workset_builder_MPI_2 79/132 Test #79: PanzerAdaptersSTK_d_workset_builder_MPI_2 ........................ Passed 1.20 sec Start 80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1 80/132 Test #80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1 ..................... Passed 1.17 sec Start 81: PanzerAdaptersSTK_cascade_MPI_2 81/132 Test #81: PanzerAdaptersSTK_cascade_MPI_2 .................................. Passed 1.17 sec Start 82: PanzerAdaptersSTK_hdiv_basis_MPI_1 82/132 Test #82: PanzerAdaptersSTK_hdiv_basis_MPI_1 ............................... Passed 3.45 sec Start 83: PanzerAdaptersSTK_workset_container_MPI_2 83/132 Test #83: PanzerAdaptersSTK_workset_container_MPI_2 ........................ Passed 1.27 sec Start 84: PanzerAdaptersSTK_field_manager_builder_MPI_1 84/132 Test #84: PanzerAdaptersSTK_field_manager_builder_MPI_1 .................... Passed 1.18 sec Start 85: PanzerAdaptersSTK_initial_condition_builder_MPI_1 85/132 Test #85: PanzerAdaptersSTK_initial_condition_builder_MPI_1 ................ Passed 1.18 sec Start 86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2 86/132 Test #86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2 ............... Passed 1.22 sec Start 87: PanzerAdaptersSTK_initial_condition_control_MPI_2 87/132 Test #87: PanzerAdaptersSTK_initial_condition_control_MPI_2 ................ Passed 1.21 sec Start 88: PanzerAdaptersSTK_assembly_engine_MPI_4 88/132 Test #88: PanzerAdaptersSTK_assembly_engine_MPI_4 .......................... Passed 1.90 sec Start 89: PanzerAdaptersSTK_simple_bc_MPI_2 89/132 Test #89: PanzerAdaptersSTK_simple_bc_MPI_2 ................................ Passed 1.24 sec Start 90: PanzerAdaptersSTK_model_evaluator_MPI_4 90/132 Test #90: PanzerAdaptersSTK_model_evaluator_MPI_4 .......................... Passed 1.39 sec Start 91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1 91/132 Test #91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1 ............... Passed 1.17 sec Start 92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4 92/132 Test #92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4 .................... Passed 1.96 sec Start 93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4 93/132 Test #93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4 ................. Passed 1.25 sec Start 94: PanzerAdaptersSTK_response_residual_MPI_2 94/132 Test #94: PanzerAdaptersSTK_response_residual_MPI_2 ........................ Passed 1.58 sec Start 95: PanzerAdaptersSTK_solver_MPI_4 95/132 Test #95: PanzerAdaptersSTK_solver_MPI_4 ................................... Passed 1.29 sec Start 96: PanzerAdaptersSTK_gs_evaluators_MPI_1 96/132 Test #96: PanzerAdaptersSTK_gs_evaluators_MPI_1 ............................ Passed 1.20 sec Start 97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1 97/132 Test #97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1 .................. Passed 1.20 sec Start 98: PanzerAdaptersSTK_periodic_bcs_MPI_4 98/132 Test #98: PanzerAdaptersSTK_periodic_bcs_MPI_4 ............................. Passed 1.28 sec Start 99: PanzerAdaptersSTK_periodic_mesh_MPI_2 99/132 Test #99: PanzerAdaptersSTK_periodic_mesh_MPI_2 ............................ Passed 1.35 sec Start 100: PanzerAdaptersSTK_bcstrategy_MPI_1 100/132 Test #100: PanzerAdaptersSTK_bcstrategy_MPI_1 ............................... Passed 1.17 sec Start 101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1 101/132 Test #101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1 ............. Passed 1.13 sec Start 102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2 102/132 Test #102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2 ................. Passed 1.28 sec Start 103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2 103/132 Test #103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2 ................... Passed 1.23 sec Start 104: PanzerAdaptersSTK_ip_coordinates_MPI_2 104/132 Test #104: PanzerAdaptersSTK_ip_coordinates_MPI_2 ........................... Passed 1.20 sec Start 105: PanzerAdaptersSTK_tGatherSolution_MPI_2 105/132 Test #105: PanzerAdaptersSTK_tGatherSolution_MPI_2 .......................... Passed 1.19 sec Start 106: PanzerAdaptersSTK_tScatterResidual_MPI_2 106/132 Test #106: PanzerAdaptersSTK_tScatterResidual_MPI_2 ......................... Passed 1.18 sec Start 107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2 107/132 Test #107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2 ................ Passed 1.19 sec Start 108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1 108/132 Test #108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1 ........................ Passed 1.16 sec Start 109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1 109/132 Test #109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1 ............... Passed 1.17 sec Start 110: PanzerAdaptersSTK_node_normals_MPI_2 110/132 Test #110: PanzerAdaptersSTK_node_normals_MPI_2 ............................. Passed 1.24 sec Start 111: PanzerAdaptersSTK_tFaceToElem_MPI_2 111/132 Test #111: PanzerAdaptersSTK_tFaceToElem_MPI_2 .............................. Passed 1.31 sec Start 112: PanzerAdaptersSTK_square_mesh_MPI_4 112/132 Test #112: PanzerAdaptersSTK_square_mesh_MPI_4 .............................. Passed 1.20 sec Start 113: PanzerAdaptersSTK_square_mesh_bc_MPI_4 113/132 Test #113: PanzerAdaptersSTK_square_mesh_bc_MPI_4 ........................... Passed 1.18 sec Start 114: PanzerAdaptersSTK_CurlLaplacianExample 114/132 Test #114: PanzerAdaptersSTK_CurlLaplacianExample ........................... Passed 2.61 sec Start 115: PanzerAdaptersSTK_MixedPoissonExample 115/132 Test #115: PanzerAdaptersSTK_MixedPoissonExample ............................ Passed 2.86 sec Start 116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 116/132 Test #116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 ............... Passed 3.40 sec Start 117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 117/132 Test #117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 ... Passed 5.70 sec Start 118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4 118/132 Test #118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4 ............... Passed 5.68 sec Start 119: PanzerAdaptersSTK_assembly_example_MPI_4 119/132 Test #119: PanzerAdaptersSTK_assembly_example_MPI_4 ......................... Passed 1.29 sec Start 120: PanzerAdaptersSTK_main_driver_energy-ss 120/132 Test #120: PanzerAdaptersSTK_main_driver_energy-ss .......................... Passed 2.89 sec Start 121: PanzerAdaptersSTK_main_driver_energy-transient 121/132 Test #121: PanzerAdaptersSTK_main_driver_energy-transient ................... Passed 1.63 sec Start 122: PanzerAdaptersSTK_main_driver_energy-ss-blocked 122/132 Test #122: PanzerAdaptersSTK_main_driver_energy-ss-blocked .................. Passed 1.55 sec Start 123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue 123/132 Test #123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue .......... Passed 1.70 sec Start 124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp 124/132 Test #124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp ............... Passed 2.82 sec Start 125: PanzerAdaptersSTK_main_driver_energy-neumann 125/132 Test #125: PanzerAdaptersSTK_main_driver_energy-neumann ..................... Passed 1.27 sec Start 126: PanzerAdaptersSTK_main_driver_meshmotion 126/132 Test #126: PanzerAdaptersSTK_main_driver_meshmotion ......................... Passed 2.05 sec Start 127: PanzerAdaptersSTK_main_driver_energy-transient-blocked 127/132 Test #127: PanzerAdaptersSTK_main_driver_energy-transient-blocked ........... Passed 1.89 sec Start 128: PanzerAdaptersSTK_me_main_driver_energy-ss 128/132 Test #128: PanzerAdaptersSTK_me_main_driver_energy-ss ....................... Passed 1.27 sec Start 129: PanzerAdaptersSTK_siamCse17_MPI_4 129/132 Test #129: PanzerAdaptersSTK_siamCse17_MPI_4 ................................ Passed 1.26 sec Start 130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1 130/132 Test #130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1 ....................... Passed 1.09 sec Start 131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2 131/132 Test #131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2 ....................... Passed 1.08 sec Start 132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3 132/132 Test #132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3 ....................... Passed 1.07 sec 100% tests passed, 0 tests failed out of 132 Label Time Summary: Panzer = 131.27 sec (132 tests) Total Test time (real) = 132.51 sec ```

Running with ctest -j16:

`ctest -E ConvTest -j16` results (click to expand)

``` [rppawlo@gge panzer]$ ctest -E ConvTest -j16 Test project /ascldap/users/rppawlo/BUILD/packages/panzer Start 117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 Start 118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4 Start 82: PanzerAdaptersSTK_hdiv_basis_MPI_1 Start 116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 Start 120: PanzerAdaptersSTK_main_driver_energy-ss Start 63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2 1/132 Test #63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2 ...................... Passed 11.92 sec Start 126: PanzerAdaptersSTK_main_driver_meshmotion 2/132 Test #82: PanzerAdaptersSTK_hdiv_basis_MPI_1 ............................... Passed 24.25 sec Start 125: PanzerAdaptersSTK_main_driver_energy-neumann 3/132 Test #125: PanzerAdaptersSTK_main_driver_energy-neumann ..................... Passed 29.45 sec Start 66: PanzerAdaptersSTK_tSTK_IO_MPI_1 4/132 Test #66: PanzerAdaptersSTK_tSTK_IO_MPI_1 .................................. Passed 6.63 sec Start 78: PanzerAdaptersSTK_workset_builder_MPI_1 5/132 Test #78: PanzerAdaptersSTK_workset_builder_MPI_1 .......................... Passed 32.46 sec Start 97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1 6/132 Test #97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1 .................. Passed 5.41 sec Start 96: PanzerAdaptersSTK_gs_evaluators_MPI_1 7/132 Test #96: PanzerAdaptersSTK_gs_evaluators_MPI_1 ............................ Passed 6.01 sec Start 85: PanzerAdaptersSTK_initial_condition_builder_MPI_1 8/132 Test #85: PanzerAdaptersSTK_initial_condition_builder_MPI_1 ................ Passed 23.44 sec Start 84: PanzerAdaptersSTK_field_manager_builder_MPI_1 9/132 Test #84: PanzerAdaptersSTK_field_manager_builder_MPI_1 .................... Passed 21.65 sec Start 91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1 10/132 Test #91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1 ............... Passed 10.72 sec Start 59: PanzerAdaptersSTK_tSTKInterface_MPI_1 11/132 Test #59: PanzerAdaptersSTK_tSTKInterface_MPI_1 ............................ Passed 6.51 sec Start 100: PanzerAdaptersSTK_bcstrategy_MPI_1 12/132 Test #100: PanzerAdaptersSTK_bcstrategy_MPI_1 ............................... Passed 18.03 sec Start 109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1 13/132 Test #109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1 ............... Passed 20.94 sec Start 80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1 14/132 Test #80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1 ..................... Passed 6.51 sec Start 108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1 15/132 Test #126: PanzerAdaptersSTK_main_driver_meshmotion ......................... Passed 208.12 sec Start 94: PanzerAdaptersSTK_response_residual_MPI_2 16/132 Test #108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1 ........................ Passed 7.62 sec Start 75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1 17/132 Test #75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1 ...... Passed 8.32 sec Start 101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1 18/132 Test #101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1 ............. Passed 6.51 sec Start 130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1 19/132 Test #130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1 ....................... Passed 8.22 sec Start 26: PanzerDiscFE_basis_values2_MPI_1 20/132 Test #26: PanzerDiscFE_basis_values2_MPI_1 ................................. Passed 1.60 sec Start 46: PanzerDiscFE_physics_block_MPI_1 21/132 Test #46: PanzerDiscFE_physics_block_MPI_1 ................................. Passed 1.60 sec Start 22: PanzerDiscFE_integration_rule_MPI_1 22/132 Test #22: PanzerDiscFE_integration_rule_MPI_1 .............................. Passed 1.60 sec Start 56: PanzerDiscFE_DOF_PointFields_MPI_1 23/132 Test #56: PanzerDiscFE_DOF_PointFields_MPI_1 ............................... Passed 8.22 sec Start 55: PanzerDiscFE_GatherCoordinates_MPI_1 24/132 Test #55: PanzerDiscFE_GatherCoordinates_MPI_1 ............................. Passed 1.60 sec Start 23: PanzerDiscFE_integration_values2_MPI_1 25/132 Test #23: PanzerDiscFE_integration_values2_MPI_1 ........................... Passed 4.01 sec Start 54: PanzerDiscFE_IntegratorScalar_MPI_1 26/132 Test #54: PanzerDiscFE_IntegratorScalar_MPI_1 .............................. Passed 1.60 sec Start 53: PanzerDiscFE_NormalsEvaluator_MPI_1 27/132 Test #94: PanzerAdaptersSTK_response_residual_MPI_2 ........................ Passed 47.88 sec Start 99: PanzerAdaptersSTK_periodic_mesh_MPI_2 28/132 Test #53: PanzerDiscFE_NormalsEvaluator_MPI_1 .............................. Passed 0.98 sec Start 57: PanzerDiscFE_DOF_BasisToBasis_MPI_1 29/132 Test #57: PanzerDiscFE_DOF_BasisToBasis_MPI_1 .............................. Passed 2.20 sec Start 32: PanzerDiscFE_evaluators_MPI_1 30/132 Test #32: PanzerDiscFE_evaluators_MPI_1 .................................... Passed 1.60 sec Start 27: PanzerDiscFE_point_values2_MPI_1 31/132 Test #27: PanzerDiscFE_point_values2_MPI_1 ................................. Passed 1.60 sec Start 44: PanzerDiscFE_closure_model_MPI_1 32/132 Test #120: PanzerAdaptersSTK_main_driver_energy-ss .......................... Passed 275.85 sec Start 115: PanzerAdaptersSTK_MixedPoissonExample 33/132 Test #44: PanzerDiscFE_closure_model_MPI_1 ................................. Passed 1.51 sec Start 28: PanzerDiscFE_boundary_condition_MPI_1 34/132 Test #28: PanzerDiscFE_boundary_condition_MPI_1 ............................ Passed 1.06 sec Start 43: PanzerDiscFE_equation_set_composite_factory_MPI_1 35/132 Test #43: PanzerDiscFE_equation_set_composite_factory_MPI_1 ................ Passed 1.60 sec Start 37: PanzerDiscFE_parameter_library_MPI_1 36/132 Test #99: PanzerAdaptersSTK_periodic_mesh_MPI_2 ............................ Passed 10.90 sec Start 64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2 37/132 Test #37: PanzerDiscFE_parameter_library_MPI_1 ............................. Passed 0.90 sec Start 45: PanzerDiscFE_closure_model_composite_MPI_1 38/132 Test #45: PanzerDiscFE_closure_model_composite_MPI_1 ....................... Passed 1.09 sec Start 40: PanzerDiscFE_view_factory_MPI_1 39/132 Test #40: PanzerDiscFE_view_factory_MPI_1 .................................. Passed 1.60 sec Start 30: PanzerDiscFE_stlmap_utilities_MPI_1 40/132 Test #30: PanzerDiscFE_stlmap_utilities_MPI_1 .............................. Passed 1.60 sec Start 39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1 41/132 Test #64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2 ...................... Passed 6.51 sec 42/132 Test #39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1 ................ Passed 1.50 sec Start 111: PanzerAdaptersSTK_tFaceToElem_MPI_2 Start 42: PanzerDiscFE_equation_set_MPI_1 43/132 Test #42: PanzerDiscFE_equation_set_MPI_1 .................................. Passed 1.80 sec 44/132 Test #117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 ... Passed 288.36 sec Start 102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2 45/132 Test #111: PanzerAdaptersSTK_tFaceToElem_MPI_2 .............................. Passed 9.02 sec Start 83: PanzerAdaptersSTK_workset_container_MPI_2 46/132 Test #83: PanzerAdaptersSTK_workset_container_MPI_2 ........................ Passed 5.43 sec Start 61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2 47/132 Test #102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2 ................. Passed 12.45 sec Start 89: PanzerAdaptersSTK_simple_bc_MPI_2 48/132 Test #61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2 ................... Passed 5.31 sec Start 110: PanzerAdaptersSTK_node_normals_MPI_2 49/132 Test #89: PanzerAdaptersSTK_simple_bc_MPI_2 ................................ Passed 11.02 sec Start 103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2 50/132 Test #115: PanzerAdaptersSTK_MixedPoissonExample ............................ Passed 37.17 sec Start 124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp 51/132 Test #110: PanzerAdaptersSTK_node_normals_MPI_2 ............................. Passed 8.22 sec Start 74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2 52/132 Test #103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2 ................... Passed 5.92 sec Start 86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2 53/132 Test #74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2 ................... Passed 6.11 sec Start 67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2 54/132 Test #86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2 ............... Passed 6.01 sec Start 70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2 55/132 Test #67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2 ..................... Passed 4.66 sec Start 87: PanzerAdaptersSTK_initial_condition_control_MPI_2 56/132 Test #87: PanzerAdaptersSTK_initial_condition_control_MPI_2 ................ Passed 5.21 sec Start 104: PanzerAdaptersSTK_ip_coordinates_MPI_2 57/132 Test #104: PanzerAdaptersSTK_ip_coordinates_MPI_2 ........................... Passed 6.21 sec Start 79: PanzerAdaptersSTK_d_workset_builder_MPI_2 58/132 Test #70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2 ................ Passed 13.88 sec Start 105: PanzerAdaptersSTK_tGatherSolution_MPI_2 59/132 Test #79: PanzerAdaptersSTK_d_workset_builder_MPI_2 ........................ Passed 6.51 sec Start 107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2 60/132 Test #105: PanzerAdaptersSTK_tGatherSolution_MPI_2 .......................... Passed 5.63 sec Start 69: PanzerAdaptersSTK_tSTKConnManager_MPI_2 61/132 Test #69: PanzerAdaptersSTK_tSTKConnManager_MPI_2 .......................... Passed 6.21 sec Start 72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2 62/132 Test #107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2 ................ Passed 10.44 sec Start 62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2 63/132 Test #72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2 ................. Passed 5.55 sec Start 106: PanzerAdaptersSTK_tScatterResidual_MPI_2 64/132 Test #62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2 .................... Passed 4.35 sec Start 71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2 65/132 Test #106: PanzerAdaptersSTK_tScatterResidual_MPI_2 ......................... Passed 7.01 sec Start 60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2 66/132 Test #71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2 ................. Passed 8.32 sec Start 73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2 67/132 Test #60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2 ......................... Passed 4.31 sec Start 81: PanzerAdaptersSTK_cascade_MPI_2 68/132 Test #73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2 ..................... Passed 5.51 sec 69/132 Test #81: PanzerAdaptersSTK_cascade_MPI_2 .................................. Passed 4.31 sec Start 114: PanzerAdaptersSTK_CurlLaplacianExample 70/132 Test #114: PanzerAdaptersSTK_CurlLaplacianExample ........................... Passed 81.47 sec Start 92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4 71/132 Test #92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4 .................... Passed 124.12 sec Start 88: PanzerAdaptersSTK_assembly_engine_MPI_4 72/132 Test #118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4 ............... Passed 665.81 sec Start 127: PanzerAdaptersSTK_main_driver_energy-transient-blocked 73/132 Test #116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 ............... Passed 705.88 sec Start 123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue 74/132 Test #124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp ............... Passed 561.85 sec Start 121: PanzerAdaptersSTK_main_driver_energy-transient 75/132 Test #127: PanzerAdaptersSTK_main_driver_energy-transient-blocked ........... Passed 298.22 sec Start 122: PanzerAdaptersSTK_main_driver_energy-ss-blocked 76/132 Test #121: PanzerAdaptersSTK_main_driver_energy-transient ................... Passed 146.27 sec Start 77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 77/132 Test #77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 ................. Passed 24.14 sec Start 90: PanzerAdaptersSTK_model_evaluator_MPI_4 78/132 Test #122: PanzerAdaptersSTK_main_driver_energy-ss-blocked .................. Passed 85.07 sec Start 119: PanzerAdaptersSTK_assembly_example_MPI_4 79/132 Test #88: PanzerAdaptersSTK_assembly_engine_MPI_4 .......................... Passed 478.97 sec Start 95: PanzerAdaptersSTK_solver_MPI_4 80/132 Test #119: PanzerAdaptersSTK_assembly_example_MPI_4 ......................... Passed 12.67 sec Start 98: PanzerAdaptersSTK_periodic_bcs_MPI_4 81/132 Test #123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue .......... Passed 363.45 sec Start 128: PanzerAdaptersSTK_me_main_driver_energy-ss 82/132 Test #90: PanzerAdaptersSTK_model_evaluator_MPI_4 .......................... Passed 25.13 sec Start 129: PanzerAdaptersSTK_siamCse17_MPI_4 83/132 Test #128: PanzerAdaptersSTK_me_main_driver_energy-ss ....................... Passed 8.99 sec Start 93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4 84/132 Test #98: PanzerAdaptersSTK_periodic_bcs_MPI_4 ............................. Passed 18.26 sec Start 21: PanzerDofMgr_scaling_test 85/132 Test #129: PanzerAdaptersSTK_siamCse17_MPI_4 ................................ Passed 9.42 sec Start 112: PanzerAdaptersSTK_square_mesh_MPI_4 86/132 Test #95: PanzerAdaptersSTK_solver_MPI_4 ................................... Passed 26.74 sec Start 68: PanzerAdaptersSTK_tGhosting_MPI_4 87/132 Test #112: PanzerAdaptersSTK_square_mesh_MPI_4 .............................. Passed 4.01 sec Start 65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4 88/132 Test #68: PanzerAdaptersSTK_tGhosting_MPI_4 ................................ Passed 4.09 sec Start 113: PanzerAdaptersSTK_square_mesh_bc_MPI_4 89/132 Test #93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4 ................. Passed 11.03 sec 90/132 Test #65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4 ........... Passed 3.38 sec Start 76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2 Start 131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2 Start 132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3 Start 31: PanzerDiscFE_shards_utilities_MPI_1 91/132 Test #31: PanzerDiscFE_shards_utilities_MPI_1 .............................. Passed 1.10 sec Start 58: PanzerDiscFE_point_descriptor_MPI_1 92/132 Test #58: PanzerDiscFE_point_descriptor_MPI_1 .............................. Passed 1.20 sec Start 36: PanzerDiscFE_global_data_MPI_1 93/132 Test #113: PanzerAdaptersSTK_square_mesh_bc_MPI_4 ........................... Passed 4.30 sec 94/132 Test #36: PanzerDiscFE_global_data_MPI_1 ................................... Passed 1.30 sec Start 50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2 Start 51: PanzerDiscFE_tCloneLOF_MPI_2 Start 25: PanzerDiscFE_basis_MPI_1 95/132 Test #76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2 ................ Passed 4.62 sec Start 52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2 96/132 Test #131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2 ....................... Passed 5.33 sec 97/132 Test #132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3 ....................... Passed 5.32 sec Start 48: PanzerDiscFE_tEpetraScatter_MPI_4 Start 29: PanzerDiscFE_material_model_entry_MPI_1 98/132 Test #25: PanzerDiscFE_basis_MPI_1 ......................................... Passed 1.11 sec Start 41: PanzerDiscFE_check_bc_consistency_MPI_1 99/132 Test #29: PanzerDiscFE_material_model_entry_MPI_1 .......................... Passed 1.21 sec Start 38: PanzerDiscFE_cell_topology_info_MPI_1 100/132 Test #48: PanzerDiscFE_tEpetraScatter_MPI_4 ................................ Passed 1.51 sec Start 47: PanzerDiscFE_tEpetraGather_MPI_4 101/132 Test #41: PanzerDiscFE_check_bc_consistency_MPI_1 .......................... Passed 1.31 sec 102/132 Test #38: PanzerDiscFE_cell_topology_info_MPI_1 ............................ Passed 1.11 sec Start 35: PanzerDiscFE_output_stream_MPI_1 Start 24: PanzerDiscFE_dimension_MPI_1 103/132 Test #52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2 ....................... Passed 3.62 sec 104/132 Test #47: PanzerDiscFE_tEpetraGather_MPI_4 ................................. Passed 1.32 sec 105/132 Test #24: PanzerDiscFE_dimension_MPI_1 ..................................... Passed 1.12 sec Start 49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4 Start 33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1 Start 34: PanzerDiscFE_zero_sensitivities_MPI_1 Start 12: PanzerDofMgr_tOrientations_MPI_1 106/132 Test #35: PanzerDiscFE_output_stream_MPI_1 ................................. Passed 1.53 sec 107/132 Test #51: PanzerDiscFE_tCloneLOF_MPI_2 ..................................... Passed 5.05 sec Start 10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2 Start 1: PanzerCore_version_MPI_1 108/132 Test #1: PanzerCore_version_MPI_1 ......................................... Passed 0.60 sec 109/132 Test #12: PanzerDofMgr_tOrientations_MPI_1 ................................. Passed 1.21 sec Start 11: PanzerDofMgr_tBlockedDOFManager_MPI_2 110/132 Test #49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4 ....................... Passed 1.32 sec 111/132 Test #34: PanzerDiscFE_zero_sensitivities_MPI_1 ............................ Passed 1.31 sec 112/132 Test #33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1 ............ Passed 1.31 sec Start 14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4 Start 13: PanzerDofMgr_tFilteredUGI_MPI_2 113/132 Test #14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4 .................. Passed 3.71 sec Start 20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4 114/132 Test #13: PanzerDofMgr_tFilteredUGI_MPI_2 .................................. Passed 6.71 sec Start 2: PanzerCore_string_utilities_MPI_1 Start 3: PanzerCore_hash_utilities_MPI_1 115/132 Test #3: PanzerCore_hash_utilities_MPI_1 .................................. Passed 0.60 sec Start 4: PanzerCore_memUtils_MPI_1 116/132 Test #20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4 ........................... Passed 3.61 sec 117/132 Test #2: PanzerCore_string_utilities_MPI_1 ................................ Passed 0.71 sec 118/132 Test #11: PanzerDofMgr_tBlockedDOFManager_MPI_2 ............................ Passed 8.33 sec 119/132 Test #4: PanzerCore_memUtils_MPI_1 ........................................ Passed 0.30 sec Start 16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4 Start 15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4 120/132 Test #10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2 ................. Passed 9.64 sec 121/132 Test #16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4 ........................... Passed 3.48 sec Start 6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4 122/132 Test #6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4 ..................... Passed 1.30 sec Start 9: PanzerDofMgr_tFieldAggPattern_MPI_4 123/132 Test #15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4 .................... Passed 5.89 sec Start 7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4 124/132 Test #9: PanzerDofMgr_tFieldAggPattern_MPI_4 .............................. Passed 1.21 sec Start 8: PanzerDofMgr_tNodalFieldPattern_MPI_4 125/132 Test #7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4 ......................... Passed 1.31 sec Start 19: PanzerDofMgr_tFieldAggPattern2_MPI_4 126/132 Test #8: PanzerDofMgr_tNodalFieldPattern_MPI_4 ............................ Passed 0.83 sec Start 17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4 127/132 Test #50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2 ........................ Passed 21.72 sec Start 5: PanzerDofMgr_tFieldPattern_MPI_4 128/132 Test #19: PanzerDofMgr_tFieldAggPattern2_MPI_4 ............................. Passed 1.30 sec Start 18: PanzerDofMgr_tFieldPattern2_MPI_4 129/132 Test #17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4 .................... Passed 1.17 sec 130/132 Test #5: PanzerDofMgr_tFieldPattern_MPI_4 ................................. Passed 1.07 sec 131/132 Test #18: PanzerDofMgr_tFieldPattern2_MPI_4 ................................ Passed 0.91 sec 132/132 Test #21: PanzerDofMgr_scaling_test ........................................ Passed 37.91 sec 100% tests passed, 0 tests failed out of 132 Label Time Summary: Panzer = 5154.83 sec (132 tests) Total Test time (real) = 1118.57 sec ```

ibaned commented 6 years ago

@dsunder would be valuable to have in this conversation

rppawlo commented 6 years ago

Here's the configure:

build_drekar.txt

bartlettroscoe commented 6 years ago

Shoot, it looks like the problem with trying to run multiple MPI jobs at the same time and slowing each other down may also be a problem with CUDA on GPUs as well, as described in https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375804502.

@nmhamster, we really need to figure out how to even just manage running multiple mpi jobs on the same nodes at the same time and not have them step on each other.

bartlettroscoe commented 6 years ago

CC: @rppawlo, @ambrad, @nmhamster

As described in https://github.com/trilinos/Trilinos/issues/2446#issuecomment-375840693, it seems that indeed the Trilinos test suite using Kokkos on the GPU does not allow the tests to be run in parallel either. I think this increases the importance of this story to get this fixed once and for all.

mhoemmen commented 6 years ago

@bartlettroscoe Do we not run the MPS server on the test machines, to let multiple MPI processes share the GPU?

bartlettroscoe commented 6 years ago

Do we not run the MPS server on the test machines, to let multiple MPI processes share the GPU?

@mhoemmen, not that I know of. There is no mention of a MPS server in any of the documentation that I can find in the files:

hansen:/opt/HANSEN_INTRO
white:/opt/WHITE_INTRO

I think this is really a question for the test beds team.

@nmhamster, do you know if the Test Bed team has any plans to set up an MPS server to manage this issue on any of the Test Bed machines?

bartlettroscoe commented 6 years ago

It looks like even non-threaded tests can't run in parallel of each other without slowing each other down as was demonstrated for the ATDM gnu-opt-serial build in https://github.com/trilinos/Trilinos/issues/2455#issuecomment-376237596. In that experiment, the test Anasazi_Epetra_BlockDavidson_auxtest_MPI_4 completed in 119 seconds when run alone but took 760 seconds to complete when run with ctest -j8 on 'hansen'.

We really need to start experimenting with the update ctest program in 'master' that has the process affinity property.

ibaned commented 6 years ago

@bartlettroscoe is it possible to get a detailed description of what this new process affinity feature in CMake does?

bartlettroscoe commented 6 years ago

is it possible to get a detailed description of what this new process affinity feature in CMake does?

@ibaned,

We will need to talk with Brad King at Kitware. Otherwise, you and get more info by looking at:

https://gitlab.kitware.com/snl/project-1/issues/17

(if you don't have access yet let me know and I can get you access).

bartlettroscoe commented 6 years ago

FYI: As pointed out by @etphipp in https://github.com/trilinos/Trilinos/issues/2628#issuecomment-384347891, setting:

 -D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none"

seems to fix the problem of OpenMP threads all binding to the same core on a RHEL6 machine.

Could this be a short-term solution to the problem of setting up automated builds of Trilinos with OpenMP enabled?

ibaned commented 6 years ago

@bartlettroscoe yes, thats a step in the right direction. Threads will at least be able to use all the cores, although they will move around and threads from different jobs will compete if using ctest -j. Still, you should get semi-decent results from this. I recommend dividing the argument to ctest -j by the number of threads per process. In fact I think --bind-to none is the best way to go until we have direct support in CMake for binding with ctest -j.

bartlettroscoe commented 6 years ago

I recommend dividing the argument to ctest -j by the number of threads per process. In fact I think --bind-to none is the best way to go until we have direct support in CMake for binding with ctest -j.

@ibaned, that is basically what we have been doing up till now in the ATDM Trilinos builds and that is consistent with how we have set the CTest PROCESSORS property. The full scope of this current Issue is to tell ctest about the total number of threads to be used and use the updated version of CMake/CTest that can set process affinity correctly.

When I get some free time on my local RHEL6 machine, I will try enabling OpenMP and setting -D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none" and then running the entire test suite for PT packages in Trilinos for the GCC 4.8.4 and Intel 17.0.1 builds and see what that looks like.

jwillenbring commented 6 years ago

@prwolfe We spoke today about updating the arguments for the GCC PR testing builds. When we do, and add OpenMP to one of them, we should use the argument described above.

prwolfe commented 6 years ago

Hmm, had not started OpenMP yet, but that would be good.

bartlettroscoe commented 6 years ago

FYI: I updated the "Possible Solutions" section above for the realization that we will need to extend CTest to be more architecture aware and to be able to pass information to the MPI jobs when they start up as described in the new Kitware backlog item:

https://docs.google.com/document/d/1TLHRp8eTNKw7udOhwIxrOYShXQUbxAzsXeOq5cwWnKM/edit#bookmark=id.e7n1479e1wbj

Without that, I don't think we have a shot at having multiple tests run with ctest to spread out over the hardware effectively and robustly.

bartlettroscoe commented 6 years ago

FYI: As @etphipp postulates in https://github.com/trilinos/Trilinos/issues/3256#issuecomment-411542910, having the OS migrate threads and processes can be as much of a 100x hit in runtime. That is more that if the tests were just pinned to one set of cores. Therefore, getting ctest and MPI to work together to allocate jobs to nodes/sockets/cores carefully and making sure they don't move could have a very large impact on the reduction in test runtimes.

ibaned commented 6 years ago

@ibaned knows things too. binding threads in an MPI process, selecting GPUs. wondering if we're basically building a scheduler here.

bartlettroscoe commented 6 years ago

We discussed this at a meeting at the TUG today.

@etphipp knows some things about getting the MPI jobs and processes to bind to the cores that are desired.

@prwolfe says that the capviz staff know some things about this area. We can ask them.

@npe9 also knows a good bit about this. Set env var OMPI_MCA_mpi_paffinity_alone=0 may help fix the problems.

@ibaned says that @dsunder has been working on getting MPI jobs and processes with Kokkos to bind to the correct set of cores and devices.

etphipp commented 6 years ago

FYI, the OpenMPI command-line option to bind MPI ranks to a specific set of processor IDs is --slot-list <slot list>. If CTest could keep track of which processors each test is using, this flag specifying a disjoint set of processors for each running test in conjunction with --bind-to none would probably work pretty well to keep tests from oversubscribing cores. If we need to control which processor ID each rank binds to, it appears the only way to do that is to use a host/rank file.

bartlettroscoe commented 6 years ago

FYI, the OpenMPI command-line option to bind MPI ranks to a specific set of processor IDs is --slot-list <slot list>. If CTest could keep track of which processors each test is using, this flag specifying a disjoint set of processors for each running test in conjunction with --bind-to none would probably work pretty well to keep tests from oversubscribing cores. If we need to control which processor ID each rank binds to, it appears the only way to do that is to use a host/rank file.

@etphipp, can we select a subset of more expensive/problem-some Trilinos tests (e.g. Tempus, Panzer) and write up a quick bash shell script that simulates this to see if this approach works compared to running with ctest -j<N> (current bad implementation) and running each time one at a time?

bartlettroscoe commented 5 years ago

One issue that @nmhamster noted was that we may also need to be careful about not overloading the machine's RAM. For this, we need to know the max memory usage for each machine and then let CTest manage that when it starts tests. For that, we need to measure this for each test and store it away to be used later.

This could be done done with cat /proc/<PID>/status then grep out VmHWM or VmRSS. But does this trace children.

The other option is getrusage that must get called inside each MPI process (but we could call that in GlobalMPISesssion destructor).

The SNL ATDM Tools Performance subproject has a tool to call this in MPI_Finalize() using the PMPI_Finalize() injection point!

But the question is if this will work on Cray with 'trinity' ('mutrino', etc.).

Then we also need to do this for the GPU as well to really do this correctly.

Just something to consider a a system that really allows us to use all of the hardware without crashing the tests.

bartlettroscoe commented 5 years ago

FYI: I think @nmhamster and @crtrott have come up with a possible strategy for addressing the problem running multiple MPI tests on the same node and GPUs at the same time and spread out and manage resources (see TRIL-231). The idea is for ctest to be given an abstract model of the machine which would include set of homogeneous nodes of the form:

Number of nodes
Number of sockets per node
Number of cores per socket
Number of GPUs per socket (can be generalized as a type of "accelerator")

and rules about how ctest is allowed to assign this set of computing resources (i.e. nodes, cores and GPUs) to MPI processes in tests such as:

A single MPI process must live on a single node (always)
A single MPI process's threads should be confined to a single socket (may relaxed this in some cases?)
A single MPI process must use a GPU (or other accelerator) assigned to that same socket (may be relaxed in some cases?)

CTest already has a concept of multiple processes per test through the PROCESSORS test property. What we would need to do is to extend ctest to define a new test property called something like THREADS_PER_PROCESS (default 1) to allow us to tell it (on a test-by-test basis) how many threads each process uses. (And we would need something similar for accelerators like GPUs.)

With this information, ctest would be extended to keep track of the computing resources (nodes, core, and GPUs) and give them out to run tests as per the rules above (trying not to overload cores or GPUs unless explicitly requested to do so). When running a given test test, ctest would set env vars to tell a test about where it should run and then we would extend TriBITS to construct mpirun commands on the fly that would read in these env vars and map them to the right things.

For example, to run a 2-process (PROCESSORS=2) MPI test that uses 4 cores per process (THREADS_PER_PROCESS=4), that all runs on the same socket sharing the same GPU, ctest would first set the env vars like:

export CTEST_TEST_NODE_0=3     # Node for 0th MPI process
export CTEST_TEST_NODE_1=3     # Node for 1st MPI process
export CTEST_TEST_CORES_0=0,1,2,3   # Cores for 0th MPI process
export CTEST_TEST_CORES_1=4,5,6,7   # Cores for 1st MPI process
export CTEST_TEST_ACCEL_0_0=1       # The 0th accelerator type (i.e. the GPU) for 0th MPI process
export CTEST_TEST_ACCEL_0_1=1       # The 0th accelerator type (i.e. the GPU) for 1st MPI process

and then we would have our test script wrapper read in these env vars and run (for OpenMPI for example):

export OMP_NUM_THREADS=4
mpirun --bind-to none \
    -H ${CTEST_TEST_NODE_0} -n 1 \
      -x CUDA_VISIBLE_DEVICES=${CTEST_TEST_ACCEL_0_0} \
      taskset -c ${CTEST_TEST_CORES_0} <exec> : \
    -H ${CTEST_TEST_NODE_1} -n 1 \
      -x CUDA_VISIBLE_DEVICES=${CTEST_TEST_ACCEL_0_1} \
      taskset -c ${CTEST_TEST_CORES_1} <exec>

(see CUDA_VISIBLE_DEVICES and OpenMPI mpirun command line options )

That way, that test would only run on this set of cores and GPUs on the 3rd node. We could do that mapping automatically using TriBITS through the existing functions tribits_add_test() and tribits_add_advanced_test(). (We would likely write a ctest -P script to read the env vars and create and run the mpirun command as tribits_add_advanced_test() already writes a ctest -P script for each test so we could just extend that.)

We need to experiment with this some on various machines to see if this will work, but if it does, that is what will be implemented by Kitware in:

https://gitlab.kitware.com/snl/project-1/issues/68

and in TriBITS.

Does anyone have sufficient interest and time to help experiment with this approach in the short term? Otherwise, it will be a while before I have time to experiment with this.

Otherwise, we will document these experiments in:

https://sems-atlassian-son.sandia.gov/jira/browse/TRIL-231

and

https://gitlab.kitware.com/snl/project-1/issues/68

bartlettroscoe commented 5 years ago

A pretty good representation of how well we are are using our test machines when we have to use ctest -j1:

state_of_parallelism

Forward by @fryeguy52 (original source unknown).

nmhamster commented 5 years ago

@bartlettroscoe / @fryeguy52 - that picture could be used all over DOE :-).

bartlettroscoe commented 4 years ago

CC: @KyleFromKitware

@jjellio, continuing from the discussion started in https://github.com/kokkos/kokkos/issues/3040, I did timing of the Trilinos test suite with a CUDA build on 'vortex' for the 'ats2' env and I found that raw 'jsrun' does not spread out over the 4 GPUs on a node on that system automatically. However, when I switched over to the new CTest GPU allocation approach in commit https://github.com/trilinos/Trilinos/pull/7427/commits/692e990cc74d37045c7ddcc3561920d05c48c9f0 as part of PR #7427, I got perfect scalability of the TpetraCore_gemm tests up to ctest -j4. See the details in PR #7427. I also repeat the timing experiments done for that PR branch below.

Details: (click to expand)

**A)** Running some timing experiments with the TpetraCore_gemm tests **without ctest GPU allocation** (just raw 'jsrun' behavior on one node): ``` $ bsub -W 6:00 -Is bash Job <196758> is submitted to default queue . <> <> Don't load the sems modules on 'vortex'! $ cd /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/ $ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt Hostname 'vortex59' matches known ATDM host 'vortex59' and system 'ats2' Setting compiler and build options for build-name 'Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt' Using ats2 compiler stack CUDA-10.1.243_GNU-7.3.1_SPMPI-ROLLING to build RELEASE code with Kokkos node type CUDA $ for n in 1 2 4 8 ; do echo ; echo ; echo "time ctest -j $n -R TpetraCore_gemm" ; time ctest -j $n -R TpetraCore_gemm ; done time ctest -j 1 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 23: TpetraCore_gemm_m_eq_1_MPI_1 1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 17.93 sec Start 24: TpetraCore_gemm_m_eq_2_MPI_1 2/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 17.63 sec Start 25: TpetraCore_gemm_m_eq_5_MPI_1 3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 17.69 sec Start 26: TpetraCore_gemm_m_eq_13_MPI_1 4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 17.56 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 70.81 sec*proc (4 tests) Total Test time (real) = 70.91 sec real 1m10.921s user 0m0.854s sys 0m0.325s time ctest -j 2 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 23: TpetraCore_gemm_m_eq_1_MPI_1 Start 25: TpetraCore_gemm_m_eq_5_MPI_1 1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 91.98 sec Start 24: TpetraCore_gemm_m_eq_2_MPI_1 2/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 91.98 sec Start 26: TpetraCore_gemm_m_eq_13_MPI_1 3/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 91.53 sec 4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 91.58 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 367.07 sec*proc (4 tests) Total Test time (real) = 183.66 sec real 3m3.669s user 0m0.750s sys 0m0.461s time ctest -j 4 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 23: TpetraCore_gemm_m_eq_1_MPI_1 Start 25: TpetraCore_gemm_m_eq_5_MPI_1 Start 24: TpetraCore_gemm_m_eq_2_MPI_1 Start 26: TpetraCore_gemm_m_eq_13_MPI_1 1/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 196.77 sec 2/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 196.88 sec 3/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 196.92 sec 4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 197.04 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 787.60 sec*proc (4 tests) Total Test time (real) = 197.15 sec real 3m17.154s user 0m0.706s sys 0m0.633s time ctest -j 8 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 23: TpetraCore_gemm_m_eq_1_MPI_1 Start 25: TpetraCore_gemm_m_eq_5_MPI_1 Start 24: TpetraCore_gemm_m_eq_2_MPI_1 Start 26: TpetraCore_gemm_m_eq_13_MPI_1 1/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 196.97 sec 2/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 196.98 sec 3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 196.98 sec 4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 197.08 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 788.01 sec*proc (4 tests) Total Test time (real) = 197.19 sec real 3m17.195s user 0m0.738s sys 0m0.555s ``` Wow, that is terrible anti-speedup. ---- **B)** Now to test running the TpetraCore_gemm_ tests again **with CTest GPU allocation approach** with different `ctest -j` levels: ``` $ bsub -W 6:00 -Is bash Job <199125> is submitted to default queue . <> <> Don't load the sems modules on 'vortex'! $ cd /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/ $ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt Hostname 'vortex59' matches known ATDM host 'vortex59' and system 'ats2' Setting compiler and build options for build-name 'Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt' Using ats2 compiler stack CUDA-10.1.243_GNU-7.3.1_SPMPI-ROLLING to build RELEASE code with Kokkos node type CUDA $ for n in 1 2 4 8 ; do echo ; echo ; echo "time ctest -j $n -R TpetraCore_gemm" ; time ctest -j $n -R TpetraCore_gemm ; done time ctest -j 1 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 23: TpetraCore_gemm_m_eq_1_MPI_1 1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 12.30 sec Start 24: TpetraCore_gemm_m_eq_2_MPI_1 2/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 11.92 sec Start 25: TpetraCore_gemm_m_eq_5_MPI_1 3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 12.36 sec Start 26: TpetraCore_gemm_m_eq_13_MPI_1 4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 11.99 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 48.57 sec*proc (4 tests) Total Test time (real) = 48.75 sec real 0m48.904s user 0m0.807s sys 0m0.362s time ctest -j 2 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 25: TpetraCore_gemm_m_eq_5_MPI_1 Start 23: TpetraCore_gemm_m_eq_1_MPI_1 1/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 12.75 sec Start 26: TpetraCore_gemm_m_eq_13_MPI_1 2/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 12.82 sec Start 24: TpetraCore_gemm_m_eq_2_MPI_1 3/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 12.65 sec 4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 12.90 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 51.13 sec*proc (4 tests) Total Test time (real) = 25.75 sec real 0m25.763s user 0m0.762s sys 0m0.445s time ctest -j 4 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 23: TpetraCore_gemm_m_eq_1_MPI_1 Start 25: TpetraCore_gemm_m_eq_5_MPI_1 Start 26: TpetraCore_gemm_m_eq_13_MPI_1 Start 24: TpetraCore_gemm_m_eq_2_MPI_1 1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 13.32 sec 2/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 13.32 sec 3/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 13.87 sec 4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 13.93 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 54.44 sec*proc (4 tests) Total Test time (real) = 14.03 sec real 0m14.036s user 0m0.853s sys 0m0.452s time ctest -j 8 -R TpetraCore_gemm Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra Start 26: TpetraCore_gemm_m_eq_13_MPI_1 Start 23: TpetraCore_gemm_m_eq_1_MPI_1 Start 24: TpetraCore_gemm_m_eq_2_MPI_1 Start 25: TpetraCore_gemm_m_eq_5_MPI_1 1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 ..... Passed 13.19 sec 2/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 .... Passed 13.26 sec 3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 ..... Passed 13.91 sec 4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 ..... Passed 14.60 sec 100% tests passed, 0 tests failed out of 4 Label Time Summary: Tpetra = 54.96 sec*proc (4 tests) Total Test time (real) = 14.70 sec real 0m14.710s user 0m0.753s sys 0m0.542s ``` Okay, so going from `ctest -j4` to `ctest -j8` is no different for these tests because there is just 4 of them. But it does prove that the ctest GPU allocation algorithm does do breath-first to allocate GPUs.

jjellio commented 4 years ago

@bartlettroscoe It depends entirely on the flags you've given to JSRUN. The issue I've linked to shows it working. It hinges on resource sets. What jsrun lines are you using?

bartlettroscoe commented 4 years ago

What jsrun lines are you using?

@jjellio, I believe the same one's being used by SPARC that these were copied form. See lines starting at:

https://github.com/trilinos/Trilinos/blob/65c88d911ce85fe96dee5fd3e0f30530fb25ff81/cmake/std/atdm/ats2/environment.sh#L163

Since the CTest GPU allocation method is working so well, I would be hesitant to change what is currently in PR #7204

jjellio commented 4 years ago

Yep, and those options do not specify a gpu or binding options. The lines currently used on most platforms for Trilinos testing are chosen to oversubscribe a system to get throughput.

The flags I used back then were:

jsrun -r1 -a1 -c4 -g1 -brs
-r1 = 1 resource set
-a1 = 1 task per resource set. (so 1 resource set ... I get 1 tasks total)
-c4 = 4 cores per task
-g1 = 1 GPU per task
-brs = bind to resource set (so you get a process mask that isolates resource sets)

The problem is the those flags use -a, which forbids -p. It could be that -a is what made the difference, but I tend to think it was -g1 - spectrum needs to know you want a GPU.

The flags I'd use are: export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs" export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"

bartlettroscoe commented 4 years ago

@jjellio,

When you say:

The lines currently used on most platforms for Trilinos testing are chosen to oversubscribe a system to get throughput.

who is doing this testing?

Otherwise, we have had problems with robustness when trying to oversubscribe on some systems (I would have to resource some).

jjellio commented 4 years ago

So, I just ran on the ATS2 testbed (rzansel)

Using -r4 -c4 -g1 -brs -p1, the jobs are serialized: (look for 'PID XYZ started', if they are parallel, you'd expect 4 PIDs starting at the start)

jjellio@rzansel46:async]$ for i in $(seq 1 20); do jsrun -r4 -c4 -g1 -brs -p1 ./runner.sh & done
[1] 92455
[2] 92456
[3] 92457
[4] 92458
[5] 92459
[jjellio@rzansel46:async]$ PID 92758 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 26 seconds to waste time
Job jsrun started at Tue May 26 11:23:36 PDT 2020
            ended at Tue May 26 11:24:02 PDT 2020

PID 92853 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 20 seconds to waste time
Job jsrun started at Tue May 26 11:24:02 PDT 2020
            ended at Tue May 26 11:24:22 PDT 2020

PID 92925 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 17 seconds to waste time
Job jsrun started at Tue May 26 11:24:22 PDT 2020
            ended at Tue May 26 11:24:39 PDT 2020

PID 92963 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 30 seconds to waste time
Job jsrun started at Tue May 26 11:24:39 PDT 2020
            ended at Tue May 26 11:25:09 PDT 2020

PID 93045 has started!

They become unserialized if you use -r1 -a1 -g1 -brs, that's pretty obnoxious. That line works in places of -p1.

So it would seem if you use:

export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs" export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"

Plus the Kitware/Ctest stuff it should work fine. My only skin in this game is more headaches on ATS2... I don't need anymore headaches on ATS2.

bartlettroscoe commented 4 years ago

So it would seem if you use:

export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs" export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"

Plus the Kitware/Ctest stuff it should work fine.

What is the advantage of selecting those options over what is listed in atdm/ats2/environment.sh currently? What is broken that this is trying to fix?

My only skin in this game is more headaches on ATS2... I don't need anymore headaches on ATS2.

It is not just you and I, it is everyone running Trilinos tests on ATS-2. The ATDM Trilinos configuration should be the recommended way for people to run the Trilinos test suite on that system.

jjellio commented 4 years ago

If you don't have a -g1 flag, then jsrun is not going to set Cuda visible devices. If Kitware is going to manage the cuda device assignment, then it shouldn't matter.

-c4 -brs is processing binding. It keeps your job on XYZ cores, and prevents other jobs from being there. I recommend anything in the range 2 to 10. I've just found 4 to be a good number. If you don't have -cN you get exactly 1 hardware thread running your job - that isn't enough to keep the GPU happy and the host code happy. You need -c2 or more. Specify -brs makes -c operate on logical cores (not hardware threads), so -c2 -brs gives you substantially more compute resources. You need a few hardware threads to keep the Nvidia threads happy (they spawn 4).

TLDR: just use -r4 -c4 -g1 -brs.

As for oversubscription stuff:

How is the ctest work interacting with using KOKKOS_NUM_DEVICES or --kokkos-ndevices.

With jsrun, it sets CUDA_VISIBLE_DEVICES which makes kokkos-ndevices always see a zero device.

jjellio commented 4 years ago

FYI, I had the same comment here: https://github.com/trilinos/Trilinos/pull/6724#issuecomment-582080188

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

bartlettroscoe commented 3 years ago

This is actually really close to getting done. We just need a tribits_add[_advanced]_test( ... NUM_THREADS_PER_PROC <numThreadsPerProc> ... ) argument for OpenMP builds and I think we have it. We have CUDA builds well covered now (at least for single node testing).

github-actions[bot] commented 2 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

bartlettroscoe commented 2 years ago

While this has been partially addressed with the CMake Resource management and GPU limiting, the full scope of this Story has not been addressed yet (see above).

trilinos / Trilinos