CASITA is a tool for automatic analysis of OTF2 trace files that have been generated with Score-P. It determines program activities with high impact on the total program runtime and the load balancing. CASITA generates an OTF2 trace with additional information such as the critical path, waiting time, and the cause of wait states. The same metrics are used to generate a summary profile which rates activities according their potential to improve the program runtime and the load balancing. A summary of inefficient patterns exposes waiting times in the individual programming models and APIs.
Internally, CASITA constructs a distributed DAG, where each node represents an event in time and edges the dependencies between events on different locations (processes, threads and CUDA streams). Events on the same locations have an implicit dependency by the happens-before relation. The local DAGs, one per MPI process, are connected via remote edges. Only MPI, OpenMP and CUDA nodes are represented in the graph. Nevertheless, events from compiler instrumentation are accounted.
Publications:
"CASITA: A Tool for Identifying Critical Optimization Targets in Distributed Heterogeneous Applications" http://dx.doi.org/10.1109/ICPPW.2014.35
"Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications" http://dx.doi.org/10.1177/1094342016661865
"Critical-blame analysis for OpenMP 4.0 offloading on Intel Xeon Phi" http://dx.doi.org/10.1016/j.jss.2015.12.050
"Integrating Critical-Blame Analysis for Heterogeneous Applications into the Score-P Workflow" http://dx.doi.org/10.1007/978-3-319-16012-2_8
"Analyzing Offloading Inefficiencies in Scalable Heterogeneous Applications" http://dx.doi.org/10.1007/978-3-319-67630-2_34
CASITA analysis requirements:
The MPI analysis is currently based on reenacting the MPI communication in forward and backward direction, which means that the respective communication records have to be available in the trace. CASITA also needs the region enter and leave events of MPI communication functions. Currently, the MPI support is limited to (two-sided) point-to-point communication and blocking collectives.
The OpenMP analysis is still based on the OPARI2 instrumentation. It requires the fork/join, parallel begin/end, and the barrier begin/end records. Both, MPI and OpenMP analysis work with the default Score-P trace output.
CUDA analysis is supported since Score-P 1.3. The respective OTF2 trace file has to contain the following information:
The OpenCL analysis also requires kernel dependencies, e.g. to detect the OpenCL queue a kernel is enqueued to or synchronized with clFinish. This is currently implemented in a Score-P development branch. OpenACC analysis is indirectly supported with the low-level paradigms CUDA and OpenCL.