Teuchos::TimeMonitor: Use Kokkos profiling region hooks if Kokkos is available

mhoemmen commented 7 years ago

@trilinos/teuchos @nmhamster @jjellio

See discussion in #874.

ibaned commented 7 years ago

To be clear: what this means is that when a Teuchos timer starts, if Kokkos is available, it would call Kokkos::Profiling::pushRegion, and similarly at stop time it would call Kokkos::Profiling::popRegion ?

mhoemmen commented 7 years ago

@ibaned Teuchos::TimeMonitor is a scope guard for Teuchos::Time. TimeMonitor's constructor would call Kokkos::Profiling::pushRegion, and TimeMonitor's destructor would call Kokkos::Profiling::popRegion.

mhoemmen commented 7 years ago

TimeMonitor would continue to start and stop its Time object, for compatibility with existing code that uses TimeMonitor::{report, summarize} for global statistics.

nmhamster commented 7 years ago

@mhoemmen @ibaned - we should talk about the creation/destruction order of Teuchos objects because downstream tools expect these to be ordered (a true stack). I /do/ this exposing this via Teuchos is the right thing to do though. Can we talk over email?

jjellio commented 7 years ago

I don't see any issue with exposing the Kokkos timers via Tuechos.

I have a forked Teuchos, that I am not sure what to do with. The forked TimeMonitor computes descriptive stats on each processor, rather than simple aggregates. This enables the report to do things like look at the min/max/mean median values, which I feel is the direction performance monitoring tools should head. On node variability seems to be worse, not better as we introduce hybrid tasks.

All of that said, my flaky hack is not public and not well tested (hence I have not used it). If we are going to take time to think about this stuff, I think we should think carefully about how to address performance variations. We still want to see end-to-end timings, but for app developers distilling the machine noise from algorithmic efforts is my top concern.

ibaned commented 7 years ago

@jjellio The Kokkos profiling tool I wrote includes a measure of imbalance across MPI ranks (max / mean). If the Teuchos timers call the Kokkos profiling functions, we could extend that tool to output the desired YAML format.

mhoemmen commented 7 years ago

I've been talking with Si about this a bit over e-mail. I think we all agree that the Kokkos profiling tools are good. We just want to make sure that we don't excessively disrupt existing workflows by James and the MueLu developers.

nmhamster commented 7 years ago

So I agree we do not want to disrupt the profiling work to date, that would be bad. One of the advantages of the Kokkos interface is that it can exist at the higher levels of the application. We have some significant but in from across the space in this respect. We may find this useful for context. Further, the profiling hooks are build into applications in the field. It would be good to understand the issues you guys see as a negative of this approach for longer term support (beyond the current Muelu study).

jjellio commented 7 years ago

The choice to use Teuchos timers is precisely because they are baked in to apps. The workflow is Tuechos timer based, so this wouldn't change anything beside adding more timer labels and potentially more measurements (minOverProcs, maxOverProcs, meanOverProcs, +___)

@ibaned How do you determine variations? An annoyance with aggregate timing is that you lose call count specific information. What I did, was compute the quartiles on each process, then report the median and 1st and 3rd quartiles. With this info, the Mean of the (per process) medians is meaningful. That bakes in the assumption that independent tasks should be independent, hence, the per process median times should be similar. You can use a global mean over the medians, or better yet, compute the median of the medians. (Or even more involved actually compute the true global median - this is annoying because you need to perform a parallel sort over the timings, or assume you can aggregate all timing data to a single process, which may be quite large...)

The real problem with descriptive stats is that you need to retain the specific timings. A timing region goes from being a += event, to being an assignment into a vector. From my experience, you can have around 1k timers, which means 1k arrays, and call counts can usually be over one thousands. That requires a non-trivial amount of storage.

I did some preliminary tests with preallocating storage version std::vector::push_back(), and noticed no real benefit.

Given the overhead of computing descriptive stats, we could add a flag to the timer's constructor, or perhaps compile time guard it with a configure variable.

You can also approximate the median by using binning .That entails either apriori knowledge or suitable bins, or the ability to guess suitable bins (I searched around for info on parallel median calculations and found very little.)

jjellio commented 7 years ago

@nmhamster My data is actually more detailed than MueLu. I have fairly good coverage of Tpetra's point CSR and MV operations. The challenge is really in getting this automated, reduced, and presented in a meaningful way. Given Trilinos based apps embraced Teuchos timers, there is no reason to change that.

A feature we could consider, is that report takes a list of columns to report. The default being aggregate timings. This would allow the regions to track other stuff, while keeping the timer report identical to what it currently looks like.

nmhamster commented 7 years ago

@jjellio - except not all applications we are trying to work with do use Teuchos timers (although I understand the Muelu cases you're looking at do). Somehow we need to work out how to merge the data we are all collecting, YAML is possibly a way to go. We should talk when I'm back at the lab so I understand what you're getting it because Github messages are not a good format :-).

nmhamster commented 7 years ago

@ibaned - see Kokkos (https://github.com/kokkos/kokkos/pull/1198) for unstructured "sections". If you plan to map these into Teuchos for tools like VTune etc, you need to use sections, not regions because the regions must be ordered correctly.

mhoemmen commented 7 years ago

@nmhamster wrote:

... [W]e should talk about the creation/destruction order of Teuchos objects because downstream tools expect these to be ordered (a true stack)....

Teuchos::TimeMonitor is a scope guard for Teuchos::Time. This means that starts and stops should normally be ordered, if I understand what you mean correctly.

The main difference is that TimeMonitor's constructor does not reset the Time object by default. This means that the timers (Time objects) tracked by TimeMonitor measure total time. They also track start counts, so TimeMonitor::report and TimeMonitor::summarize can report average times.

TimeMonitor actually has optional Valgrind hooks: https://trilinos.org/docs/dev/packages/teuchos/doc/html/classTeuchos_1_1TimeMonitor.html

It also predates Kokkos (at least Kokkos 2.0) by a few years ;-) . MueLu uses it extensively and built an infrastructure on top of it, in order to track timings for different multigrid levels.

nmhamster commented 7 years ago

@mhoemmen - if you can guarantee all calls to the timers are perfectly nested (i.e. there is no overlap) then you are good to go with Kokkos regions. This is somewhat an issue presented to us by various vendor tools, not a Kokkos implantation choice.

I'm not saying we shouldn't use Teuchos, I'm trying to point out the advantages of having genuine hooks that allow third party connectivity rather than having to extend Teuchos (and add dependencies all the time). The hooks give us a lot of freedom. It's somewhat orthogonal to @jjellio studies in some ways. I think these can coexist without any problem and this would give the application teams and tool writers some significant improvement in their ability to get deeper into the code. Particularly within ECP ST there are many tools who would like this kind of connectivity. Teuchos timers as-is can stay and continue to be useful.

mhoemmen commented 7 years ago

@nmhamster wrote:

...[I]f you can guarantee all calls to the timers are perfectly nested (i.e. there is no overlap) then you are good to go with Kokkos regions.

Hmmm, I think I get what you mean now. Correct use of Teuchos::TimeMonitor implies that each timer cannot overlap with itself. It's possible to get timers to overlap with each other, but it requires breaking the scope guard idiom and handling TimeMonitor objects by pointer. For example:

auto t1 = TimeMonitor::getNewCounter ("t1");
auto t2 = TimeMonitor::getNewCounter ("t2");

// This line starts t1.  Please don't ever do this; it's annoying.
std::unique_ptr<TimeMonitor> tm1 (new std::unique_ptr<TimeMonitor> (*t1));
// ... do some stuff ...
{
  TimeMonitor tm2 (*t1); // this line starts t2
  // ... do some other stuff ...
  tm1.reset (); // t1 stops here.  Please don't ever do this.
  // ... do yet more stuff ...
} // this line stops t2

It would be weird to do this. MueLu developers will sometimes handle TimeMonitor objects by pointer, because they haven't developed an interface for an "optional scope guard." However, they shouldn't overlap timers like this, as far as I know.

nmhamster commented 7 years ago

@mhoemmen - I'm a little nervous that it also means all downstream calls must also nest correctly. While I think it's possible to do that with care I don't know if we will always get it right. This would seriously upset some of the vendor tools we connect into. Let me know what you think is best.

jjellio commented 7 years ago

@nmhamster, I don't see how any of this would impact me. I would view it as an optional way to have Tuechos::TimeMonitor::report() have cool new columns, such as energy, memory usage, numa faults, or vectorization use. The latter would be an extremely helpful column to have.

nmhamster commented 7 years ago

@jjellio - that's kind of what I've been trying to tell you guys :-) ... but we want/need Teuchos to support it. The confusion comes that we could do timing this way too -- but I mean more through profiling etc, not through application reporting etc. Basically we want to build of the timer placement you're already doing so we don't need to put explicit Kokkos profiling code when Teuchos is already there (application developers get two for one). I think there's opportunities to add more hardware stuff for your studies if we can pull this off.

nmhamster commented 7 years ago

@jjellio - we also maybe able to give you more detail from Kokkos regions in higher level application calls. Some of the applications seem to use these now, admittedly it's often blended with Teuchos. That's why I was saying we get application level info too.

mhoemmen commented 7 years ago

@nmhamster wrote:

I'm a little nervous that it also means all downstream calls must also nest correctly.

Does my example illustrate a case of "incorrect nesting of downstream calls"? If not, what would be an example of this?

I'm trying to pin this down because it may be that we are worrying about nothing, since correct use of TimeMonitor should imply no nesting.

jjellio commented 7 years ago

@nmhamster I am not really married to Teuchos, I have adopted the YAML data format though.

I am going to write data some details of the above. The data, metadata, and a consistent structure for them is what I view as the goal. The integration into Tuechos is nice, but it doesn't resolve the problems of how performance data is communicated.

I'll mock something up and send it out.

mhoemmen commented 4 years ago

This has been in Trilinos for a long time: https://github.com/trilinos/Trilinos/blob/4e2aa2f8354d3826292a7e68dda3875a0d1ca161/packages/teuchos/core/src/Teuchos_Time.cpp#L129

trilinos / Trilinos

Teuchos::TimeMonitor: Use Kokkos profiling region hooks if Kokkos is available #1914