tud-zih-energy / lo2s

Linux OTF2 Sampling - A Lightweight Node-Level Performance Monitoring Tool
https://tu-dresden.de/zih/forschung/projekte/lo2s?set_language=en
GNU General Public License v3.0
44 stars 13 forks source link

Tracing NVHPC compiled MPI applications results in: Aborting: No such process #320

Closed cvonelm closed 4 months ago

cvonelm commented 4 months ago

Compile this MPI Hello World:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
}

with NVHPC and run it as

# lo2s -- mpirun `nproc` ./mpi_hello_world

and it will result in a crash:

[403761387667840][pid: 18869][tid: 18869][ WARN]: Attempting to update name of unknown process 18958 (mpi_hello_world)
[403763207670400][pid: 18869][tid: 18869][ WARN]: Attempting to update name of unknown process 19709 (nvidia-modprobe)
[403763208176640][pid: 18869][tid: 18869][ WARN]: Attempting to update name of unknown process 19714 (nvidia-modprobe)
[403763208208224][pid: 18869][tid: 18869][ WARN]: Attempting to update name of unknown process 19715 (nvidia-modprobe)
[403763208223936][pid: 18869][tid: 18869][ WARN]: Attempting to update name of unknown process 19712 (nvidia-modprobe)
[403763208875264][pid: 18869][tid: 18869][ERROR]: perf_event_open for sampling failed
[403763208881856][pid: 18869][tid: 18869][ERROR]: maybe the specified clock is unavailable?
[403763208920448][pid: 18869][tid: 18869][ERROR]: Failure while adding new process forked from thread 19067: No such process
[403763357557952][pid: 18869][tid: 18869][FATAL]: Aborting: No such process
cvonelm commented 4 months ago

Fixed in 690ab12