Open flezaalv opened 5 months ago
Does app pass without Unitrace? Yes, it does, the app without Unitrace finishes with 0 return status.
Does it fail even with smaller number of ranks?
I tested with mpiexec -n 2 -ppn 2
and get this error:
/run_mpi.sh: line 7: 169430 Segmentation fault (core dumped) python bin/sr.py
[INFO] Log is stored in /home/test10/results.169391.0.csv
[INFO] Timeline is stored in /home/test10/run_mpi.sh.169391.0.json
hostname: rank 0 exited with code 139
hostname: rank 1 died from signal 15
The run_mpi.sh
contains the entire app command. This is the mpiexec instruction with unitrace included:
mpiexec -n 2 -ppn 2 ~/pti-gpu/tools/unitrace/build/unitrace --separate-tiles --chrome-device-logging --ccl-summary-report --output-dir-path /home/test10/ --output /home/test10/results.csv ./run_mpi.sh
Sure, I will share you more details.
Thanks!
I launched unitrace in a mpiexec command:
mpiexec -n 12 -ppn 12 --pmi=pmix ~/pti-gpu/tools/unitrace/build/unitrace --separate-tiles --chrome-device-logging --ccl-summary-report --output-dir-path /home/test --output /home/test/test.csv python bin/sr.py
This is executed in a single node, 12 processes are created, but when they finishes I got this error from one process and the entire mpiexec fails:
hostname: rank 0 died from signal 15
I got this error in unitrace too https://github.com/intel/pti-gpu/issues/25, is this error the cause of signal 15?