Closed rppawlo closed 6 years ago
Just FYI, this stacked printout is exactly what the space-time-stack
Kokkos profiling tool does, so another way to do this is to use that tool and have Teuchos timers emit Kokkos profiling info.
@rppawlo @bathmatt Thanks! :-D Should TimeMonitor just do this by default? I think sorting is a byproduct of set union / set intersection (not all timers may live on all processes in the communicator, so TimeMonitor does a fancy thing that I implemented long ago).
@trilinos/muelu has a different set of modifications to TimeMonitor; do y'all talk to each other?
@ibaned I agree but Trilinos developers have voiced concerns about the level of support for Kokkos profiling tools, compared with the simple-but-effective TimeMonitor approach.
@ibaned - did not know about that. Thanks for pointing out. Will look into it.
@mhoemmen - If time monitor would do the nested sorting, it would be a big help. I'm not sure how much work that would be though as opposed to just using the new StackedTimer object. One difference between StackedTimer and TimeMonitor is that app developers didn't like was the destructor stopping the timer. They felt more comfortable with an explicit stop command. Also - having the ability to aggregate over a subset of mpi processes is critical. We use that in the multiphysics coupling in CASL since the apps live on different sets of processes. wrt your question about muelu - ironically, we are using the StackedTimer to look at muelu performance and are working with Jonathan right now but didn't discuss particulars of how we were collecting the timings. I will follow up with him. Good to know! Thanks!
I just updated the output above to enable all trilinos timers (I only had specific packages enabled before). This is for a Drekar IMEX run. @mhoemmen - as you mentioned, the muelu timers bypass the TimeMonitor object so muelu times are not being reported in the stacked timer output.
@mhoemmen wrote:
I agree but Trilinos developers have voiced concerns about the level of support for Kokkos profiling tools, compared with the simple-but-effective TimeMonitor approach.
Would it help to have a meeting between those expressing concerns and the Kokkos developers? It just seems like there are duplicate efforts here. This StackedTimer is great, because it works exactly the way I think timers ought to, and hence the way I designed the Kokkos profiling output. I'm just worried about us ending up with three different profiling mechanisms in Kokkos+Teuchos, especially if users can't consolidate information from all three into one whole picture. @nmhamster may be interested in this thread as well. At a minimum I suggest that StackedTimer and regular Teuchos timers make Kokkos profiling calls, so that there is one place where all the information collects.
@mhoemmen - can you describe the concerns about the profiling tools support that you are hearing? It would be good to understand what isn't available or working that needs to be.
A meeting would be good to talk about this. The StackedTimer is the object that EMPIRE developed and is using for all application level timings, so regardless of whether it goes into Trilinos or we find a unified solution, we should should integrate the kokkos profiling into it.
@rppawlo / @ibaned / @mhoemmen - can we set something up? Kokkos has profiling API calls available for this you can use.
Can we meet today by any chance?
I sent out an invite for next Tuesday, but I will be here today till noon
@ibaned @nmhamster I don't have concerns personally with the profiling tools, but I've heard MueLu developers express dissatisfaction. I think it would help to invite them so they can air that; right now it's more something that comes out in occasional remarks.
@nmhamster wrote:
Kokkos has profiling API calls available for this you can use.
I know; thanks for working on them btw :-D I try to advocate them but I get pushback for various reasons. I'll write more in an e-mail to you and @ibaned .
@rppawlo - I am available after 2 Mountain. @mhoemmen - thanks for the heads up. We should try to work this out.
@rppawlo wrote:
we should should integrate the kokkos profiling into it.
FYI, Teuchos::TimeMonitor is hooked into Kokkos profiling regions by default. I forget who plugged it in -- maybe @aprokop ? -- and I enabled the feature by default.
FYI, Teuchos::TimeMonitor is hooked into Kokkos profiling regions by default. I forget who plugged it in -- maybe @aprokop ? -- and I enabled the feature by default.
I think @cgcgcg did this.
@ibaned - I was able to run the kokkos space-time-stack on a simple panzer tests. There it worked fine other than having mangled names for certain objects (see below). However when I tried to run this on the same drekar test problem as above it just seg faulted at the very beginning of the run. This is on openmp build with gcc 7.2 on my workstation. One mpi process, and OMP_NUM_THREADS=1. I put the gdb stack trace below. valgrind didn't turn up anything.
Mangled output:
8: <average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <number of calls> <name> [type]
8: ===================
8: |-> 5.15e+00 sec 97.8% 13.9% 0.0% 1 User App: Total Time [region]
8: |-> 2.78e+00 sec 52.8% 21.8% 0.0% 5 Thyra::NOXNonlinearSolver::solve [region]
8: | | |-> 9.65e-01 sec 18.3% 37.0% 0.1% 12 panzer::ModelEvaluator::evalModel(J) [region]
8: | | | |-> 8.10e-01 sec 15.4% 44.1% 3.9% 12 panzer::AssemblyEngine::evaluate_volume(panzer::Traits::Jacobian) [region]
8: | | | | |-> 9.78e-02 sec 1.9% 100.0% 2.4% 2400 N6panzer29Integrator_GradBasisDotVectorINS_6Traits8JacobianES1_EE/N6panzer29Integrator_GradBasisDotVectorINS_6Traits8JacobianES1_E12FieldMultTagILi0EEE [for]
8: | | | | |-> 9.34e-02 sec 1.8% 100.0% 4.2% 2400 N6panzer27Integrator_BasisTimesScalarINS_6Traits8JacobianES1_EE/N6panzer27Integrator_BasisTimesScalarINS_6Traits8JacobianES1_E12FieldMultTagILin1EEE [for]
8: | | | | |-> 7.22e-02 sec 1.4% 100.0% 1.9% 2400 N6panzer12_GLOBAL__N_121evaluateGrad_withSensIN6Sacado3Fad4DFadIdEEN3PHX7MDFieldIdNS_4CellENS_5BASISENS_2IPENS_3DimEvvvvEEEE [for]
8: | | | | |-> 5.36e-02 sec 1.0% 100.0% 5.3% 2400 N6panzer27Integrator_BasisTimesScalarINS_6Traits8JacobianES1_EE/N6panzer27Integrator_BasisTimesScalarINS_6Traits8JacobianES1_E12FieldMultTagILi0EEE [for]
8: | | | | |-> 4.00e-02 sec 0.8% 100.0% 4.6% 2400 N6panzer12dof_functors26EvaluateDOFWithSens_ScalarIN6Sacado3Fad4DFadIdEEN3PHX7MDFieldIdNS_4CellENS_5BASISENS_2IPEvvvvvEEEE [for]
gdb backtrace:
STK Mesh data:
Spatial dim = 2
Entity counts (Nodes, Edges, Cells) = ( 3075, 5122, 2048 )
Element blocks = "eblock-0_0"
Sidesets = "bottom" "left" "right" "top"
Nodesets = "lower_left" "origin"
Program ended before "panzer::DOFManager::buildGlobalUnknowns/panzer::DOFManager::buildTaggedMultiVector/panzer::DOFManager::builderOverlapMapFromElements/" ended
Thread 1 "drekar.exe" received signal SIGABRT, Aborted.
0x0000003e4f6325e5 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003e4f6325e5 in raise () from /lib64/libc.so.6
#1 0x0000003e4f633dc5 in abort () from /lib64/libc.so.6
#2 0x00007fffba9677f2 in (anonymous namespace)::State::~State (this=0xefdd10, __in_chrg=<optimized out>) at kp_space_time_stack.cpp:456
#3 0x00007fffba967ad5 in kokkosp_finalize_library () at kp_space_time_stack.cpp:570
#4 0x00007fffca26b63e in Kokkos::Profiling::finalize () at /ascldap/users/rppawlo/Trilinos/packages/kokkos/core/src/impl/Kokkos_Profiling_Interface.cpp:306
#5 0x00007fffca26efb6 in Kokkos::OpenMP::finalize () at /ascldap/users/rppawlo/Trilinos/packages/kokkos/core/src/OpenMP/Kokkos_OpenMP_Exec.cpp:421
#6 0x00007fffca26ea7e in Kokkos::OpenMP::initialize (thread_count=-1) at /ascldap/users/rppawlo/Trilinos/packages/kokkos/core/src/OpenMP/Kokkos_OpenMP_Exec.cpp:293
#7 0x00007fffca26362e in Kokkos::Impl::(anonymous namespace)::initialize_internal (args=...) at /ascldap/users/rppawlo/Trilinos/packages/kokkos/core/src/impl/Kokkos_Core.cpp:103
#8 0x00007fffca264a47 in Kokkos::initialize (arguments=...) at /ascldap/users/rppawlo/Trilinos/packages/kokkos/core/src/impl/Kokkos_Core.cpp:499
#9 0x00007fffccf0a343 in KokkosCompat::Details::(anonymous namespace)::initializeKokkosWithStruct (args=...)
at /ascldap/users/rppawlo/Trilinos/packages/teuchos/kokkoscompat/src/KokkosCompat_Details_KokkosInit.cpp:38
#10 0x00007fffccf11fcd in std::__invoke_impl<void, void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&> (
__f=@0x7fffccf0a31f: {void (const Kokkos::InitArguments &)} 0x7fffccf0a31f <KokkosCompat::Details::(anonymous namespace)::initializeKokkosWithStruct(Kokkos::InitArguments const&)>, __args#0=...)
at /home/rppawlo/tpls/gcc/7.2.0/include/c++/7.2.0/bits/invoke.h:60
#11 0x00007fffccf10ae5 in std::__invoke<void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&> (
__fn=@0x7fffccf0a31f: {void (const Kokkos::InitArguments &)} 0x7fffccf0a31f <KokkosCompat::Details::(anonymous namespace)::initializeKokkosWithStruct(Kokkos::InitArguments const&)>, __args#0=...)
at /home/rppawlo/tpls/gcc/7.2.0/include/c++/7.2.0/bits/invoke.h:95
#12 0x00007fffccf0f576 in void std::call_once<void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&>(std::once_flag&, void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&)::{lambda()#1}::operator()() const (__closure=0x7fffffff2140) at /home/rppawlo/tpls/gcc/7.2.0/include/c++/7.2.0/mutex:672
#13 0x00007fffccf0f5c1 in void std::call_once<void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&>(std::once_flag&, void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&)::{lambda()#2}::operator()() const (__closure=0x0) at /home/rppawlo/tpls/gcc/7.2.0/include/c++/7.2.0/mutex:677
#14 0x00007fffccf0f5d2 in void std::call_once<void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&>(std::once_flag&, void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&)::{lambda()#2}::_FUN() () at /home/rppawlo/tpls/gcc/7.2.0/include/c++/7.2.0/mutex:677
#15 0x0000003e4fe0ce03 in pthread_once () from /lib64/libpthread.so.0
#16 0x00007fffccf0a2ca in __gthread_once (__once=0x7fffcd11ddec <KokkosCompat::Details::initKokkosWithCmdLineArgs_flag>, __func=0x7fffc5aebc70 <std::__once_proxy()>)
at /home/rppawlo/tpls/gcc/7.2.0/include/c++/7.2.0/x86_64-pc-linux-gnu/bits/gthr-default.h:699
#17 0x00007fffccf0f664 in std::call_once<void (&)(Kokkos::InitArguments const&), Kokkos::InitArguments const&> (__once=...,
__f=@0x7fffccf0a31f: {void (const Kokkos::InitArguments &)} 0x7fffccf0a31f <KokkosCompat::Details::(anonymous namespace)::initializeKokkosWithStruct(Kokkos::InitArguments const&)>, __args#0=...)
at /home/rppawlo/tpls/gcc/7.2.0/include/c++/7.2.0/mutex:684
#18 0x00007fffccf0a517 in KokkosCompat::Details::initializeKokkos (theArgs=...) at /ascldap/users/rppawlo/Trilinos/packages/teuchos/kokkoscompat/src/KokkosCompat_Details_KokkosInit.cpp:85
#19 0x00007fffccf070bd in Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace>::KokkosDeviceWrapperNode (this=0x11380b0, params=...)
at /ascldap/users/rppawlo/Trilinos/packages/teuchos/kokkoscompat/src/KokkosCompat_ClassicNodeAPI_Wrapper.cpp:131
#20 0x00007fffcd963366 in KokkosClassic::Details::getNode<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > (params=...)
at /ascldap/users/rppawlo/Trilinos/packages/tpetra/classic/NodeAPI/Kokkos_DefaultNode.cpp:56
#21 0x00007ffff3222812 in Tpetra::defaultArgNode<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > () at /ascldap/users/rppawlo/Trilinos/packages/tpetra/core/src/Tpetra_Map_decl.hpp:90
#22 0x00007fffd0786ecc in Tpetra::createNonContigMapWithNode<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > (elementList=..., comm=..., node=...)
at /ascldap/users/rppawlo/Trilinos/packages/tpetra/core/src/Tpetra_Map_def.hpp:2165
#23 0x00007fffd0796dea in Tpetra::createNonContigMap<int, int> (elementList=..., comm=...) at /ascldap/users/rppawlo/Trilinos/packages/tpetra/core/src/Tpetra_Map_def.hpp:2145
#24 0x00007fffeb7fa6fd in panzer::DOFManager<int, int>::buildOverlapMapFromElements (this=0x11f31f0, access=...) at /ascldap/users/rppawlo/Trilinos/packages/panzer/dof-mgr/src/Panzer_DOFManager_impl.hpp:1278
#25 0x00007fffeb7fadb3 in panzer::DOFManager<int, int>::buildTaggedMultiVector (this=0x11f31f0, ownedAccess=...) at /ascldap/users/rppawlo/Trilinos/packages/panzer/dof-mgr/src/Panzer_DOFManager_impl.hpp:895
#26 0x00007fffeb7f5ca3 in panzer::DOFManager<int, int>::buildGlobalUnknowns (this=0x11f31f0, geomPattern=...) at /ascldap/users/rppawlo/Trilinos/packages/panzer/dof-mgr/src/Panzer_DOFManager_impl.hpp:484
#27 0x00007fffeb795746 in panzer::BlockedDOFManager<int, int>::buildGlobalUnknowns (this=0x11e4ee0, indexer=..., geomPattern=...)
at /ascldap/users/rppawlo/Trilinos/packages/panzer/dof-mgr/src/Panzer_BlockedDOFManager_impl.hpp:722
#28 0x00007fffeb792bd9 in panzer::BlockedDOFManager<int, int>::buildGlobalUnknowns (this=0x11e4ee0, geomPattern=...)
at /ascldap/users/rppawlo/Trilinos/packages/panzer/dof-mgr/src/Panzer_BlockedDOFManager_impl.hpp:1012
#29 0x00007fffeb7929c5 in panzer::BlockedDOFManager<int, int>::buildGlobalUnknowns (this=0x11e4ee0) at /ascldap/users/rppawlo/Trilinos/packages/panzer/dof-mgr/src/Panzer_BlockedDOFManager_impl.hpp:1058
#30 0x00007fffed37d504 in panzer::BlockedDOFManagerFactory<int, int>::buildUniqueGlobalIndexer (this=0x7fffffff41d0, mpiComm=..., physicsBlocks=..., connMngr=..., fieldOrder=...)
at /ascldap/users/rppawlo/Trilinos/packages/panzer/disc-fe/src/Panzer_BlockedDOFManagerFactory_impl.hpp:172
#31 0x00007ffff4217897 in drekar::UniqueGlobalIndexerFactory<int>::buildUniqueGlobalIndexer (this=0x7fffffff4590, mpiComm=..., physicsBlocks_exp=..., physicsBlocks_imp=..., connManager=..., fieldOrder=...,
numExpBlocks=@0xebd510: 1) at /ascldap/users/rppawlo/Trilinos/DrekarBase/drekar/src/core/utilities/Drekar_UniqueGlobalIndexerFactory_impl.hpp:156
#32 0x00007ffff41e74b2 in drekar::PhysicsModelEvaluatorContainer::buildGlobalIndexers (this=0xebd480, fieldOrder=..., auxFieldOrder=...)
at /ascldap/users/rppawlo/Trilinos/DrekarBase/drekar/src/core/utilities/Drekar_PhysicsModelEvaluatorContainer.cpp:166
#33 0x00007ffff3f6d69f in drekar::Driver::Driver (this=0x7fffffffad30, in_comm=..., in_eqset_factory=..., in_cm_factory=..., in_bc_factory=..., in_adapterParams=..., input_params=..., distrParamLib=...,
auxGlobEvalDataContainer=..., in_adjoint_eqset_factory=..., in_adjoint_upscaling=false, instance_name=..., exodus_io_num_procs=0, inline_x_elements=-1, inline_y_elements=-1, inline_z_elements=-1,
in_output_responses=false, in_input_file_name=..., in_mat_data_path=..., in_adjoint_input_file_name=..., in_ho_input_file_name=...)
at /ascldap/users/rppawlo/Trilinos/DrekarBase/drekar/src/core/utilities/Drekar_Driver.cpp:518
#34 0x00000000007a838b in main (argc=2, argv=0x7fffffffc2e8) at /ascldap/users/rppawlo/Trilinos/DrekarBase/drekar/bin/Drekar_Main.cpp:413
@rppawlo these should be opened as issues on the kokkos/kokkos-tools
repository. Mangled names are expected. Although I can try to add demangling support, the best solution is to pass a name string as the first argument to parallel_for
, then it will show up with that name. As for the segfault, I'd have to replicate the build to see what happened.
LABEL YER KERNELS PLZ :D
All done!
There is a beautiful StackedTimer object that @bathmatt developed that tracks the timer levels to show the ordering that events are called in addition to reporting timing information. We have just added it to the teuchos package in a branch. We have also added the ability to optionally inject the TimeMonitor results into the StackedTimer. This allows codes instrumented with TimeMonitor to automatically inject timings into the StackedTimer.
An example of output from TimeMonitor compared to StackedTimer is below for a Drekar unit test (we will clean up the spacing of output times before moving to PR). It literally takes 3 lines of code to get the StackedTimer report from main if you are already using the TimeMonitor.
@trilinos/teuchos @trilinos/panzer @bathmatt
Expectations
Get more information from timings
Current Behavior
TimeMonitor prints the timers in order of sotrage in the map. We lose information about the order these are called and nesting. This can be critical in diagnosing issues.
Motivation and Context
The new StackedTimer retains information about ordering and can print all information previously generated with the TimeMonitor. In addition, it is interoperable with TimeMonitor in that any code instrumented with the TimeMonitor will inject timers into StackedTimer.
TimeMonitor report:
StackedTimer report: