pmodels / pilgrim

Logger for MPI communication
Other
26 stars 6 forks source link

Segfaults for `-np > 2` when `PILGRIM_TIMING_MODE!=[AGGREGATED or ZSTD]` #36

Closed mhaseeb123 closed 1 year ago

mhaseeb123 commented 1 year ago

Hi,

I am encountering an abort at MPI process 0 when I set PILGRIM_TIMING_MODE to anything but AGGREGATED or ZSTD for mpirun/srun -np >=2. Here is a log from running the sendall test.

# input 
export PILGRIM_DEBUG=0/1 # doesn't matter
export LD_PRELOAD=/global/cfs/cdirs/m1759/mhaseeb/pilgrim/gnu/install/lib/libpilgrim.so
export PILGRIM_TIMING_MODE=LOSSLESS
export PILGRIM_TRACING=ON
srun -n 2 /global/cfs/cdirs/m1759/mhaseeb/pilgrim/pilgrim/test/mpi/pt2pt/sendall

# output 
 No Errors
MPI_Comm_size, 1, 1
MPI_Send, 15, 150
MPI_Comm_rank, 1, 2
MPI_Reduce, 1, 1
MPI_Wait, 1, 150
MPI_Barrier, 1, 165
MPI_Irecv, 16, 150
[pilgrim] Current memory usage: 19880, Peak memory usage: 48876
[pilgrim] Total mpi calls: 0.000000 *1e6
[pilgrim] CST inter-process compression time: 0.01
[pilgrim] CFG inter-process compression time: 0.00
[pilgrim] CST Size: 2.53KB, CFG Size: 0.00KB, Total: 2.53KB
srun: error: nid005880: task 0: Aborted
srun: Terminating StepId=11439790.0
slurmstepd: error: *** STEP 11439790.0 ON nid005880 CANCELLED AT 2023-07-10T22:20:47 ***
srun: error: nid005880: task 1: Terminated
srun: Force Terminated StepId=11439790.0

For HIST and CFG a No Errors does show up but then a segfault follows like this.

#input
export PILGRIM_DEBUG=1
export LD_PRELOAD=/global/cfs/cdirs/m1759/mhaseeb/pilgrim/gnu/install/lib/libpilgrim.so
export PILGRIM_TIMING_MODE=HIST/CFG # same error for either of these
export PILGRIM_TRACING=ON
export DVS_MAXNODES=24
srun -n 2 /global/cfs/cdirs/m1759/mhaseeb/pilgrim/pilgrim/test/mpi/pt2pt/sendall

# output
 No Errors
srun: error: nid005100: task 1: Segmentation fault
srun: Terminating StepId=11441219.0
slurmstepd: error: *** STEP 11441219.0 ON nid005100 CANCELLED AT 2023-07-10T22:50:06 ***
srun: error: nid005100: task 0: Terminated
srun: Force Terminated StepId=11441219.0

Here is my configure command used when building pilgrim.

./configure --prefix=/global/cfs/cdirs/m1759/mhaseeb/pilgrim/gnu/install --enable-shared=yes 
--enable-static=no --enable-debug=no --enable-tid=yes --enable-pointers=yes --enable-cuda=yes 
CC=/path/to/mpicc CXX=/path/to/mpicxx 

I would appreciate any help with this. Thank you!

wangvsa commented 1 year ago

Hi, I couldn't reproduce the issue on my side. All tests worked fine on my machines. Which machine and MPI implementation are you using?

One suggestion, instead of setting LD_PRELOAD globally, can you try setting it only for the srun command? i.e., srun -n 2 --export=ALL,LD_PRELOAD=/global/cfs/cdirs/m1759/mhaseeb/pilgrim/gnu/install/lib/libpilgrim.so ...?

Also, there's no need to set PILGRIM_TRACING if you are using the default tracing mode. This shouldn't cause the error though.

mhaseeb123 commented 1 year ago

Thank you for getting back.

I am running on the Perlmutter machine and am using cray-mpich/8.1.25 to compile. Let me try your suggestion on using LD_PRELOAD within the srun command. We do have a couple other MPI modules that I can try to compile the apps with and see.

wangvsa commented 1 year ago

mpich should be fine, we tested mostly with mpich. I do have access to Perlmutter, I will test on it too.

mhaseeb123 commented 1 year ago

Thank you @wangvsa. Appreciate it!

wangvsa commented 1 year ago

I found the issue. It was caused by those global variables, I was wrong before, we can not just add static to them. I have fixed the issue and also did some code cleaning to get rid of some compile warnings. Here's the PR: https://github.com/pmodels/pilgrim/pull/37

mhaseeb123 commented 1 year ago

Pilgrim does seem to be working after locally merging the PR #37 on Perlmutter. Thanks again for your prompt help.