mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

swim_apply_member_updates() segfault with 320 server group #62

Open carns opened 2 years ago

carns commented 2 years ago

I'll have to try this again later with debugging symbols enabled, but I'm pretty consistently hitting this:

[h15n13:2901067] *** Process received signal ***
[h15n13:2901067] Signal: Segmentation fault (11)
[h15n13:2901067] Signal code: Address not mapped (1)
[h15n13:2901067] Failing at address: 0x38
[h15n13:2901067] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000504d8]
[h15n13:2901067] [ 1] /autofs/nccs-svm1_home1/carns/working/src/spack/var/spack/environments/mochi-quintain/.spack-env/view/lib/libssg.so.0(swim_apply_member_updates+0x1c)[0x200000117eac]
[h15n13:2901067] [ 2] /autofs/nccs-svm1_home1/carns/working/src/spack/var/spack/environments/mochi-quintain/.spack-env/view/lib/libssg.so.0(+0x18ac0)[0x200000118ac0]
[h15n13:2901067] [ 3] /autofs/nccs-svm1_home1/carns/working/src/spack/var/spack/environments/mochi-quintain/.spack-env/view/lib/libssg.so.0(+0x1a138)[0x20000011a138]
[h15n13:2901067] [ 4] /autofs/nccs-svm1_home1/carns/working/src/spack/var/spack/environments/mochi-quintain/.spack-env/view/lib/libssg.so.0(_wrapper_for_swim_dping_ack_recv_ult+0x5c)[0x20000011ad0c]
[h15n13:2901067] [ 5] /autofs/nccs-svm1_home1/carns/working/src/spack/var/spack/environments/mochi-quintain/.spack-env/view/lib/libabt.so.1(+0x1fb18)[0x2000002bfb18]
[h15n13:2901067] [ 6] /autofs/nccs-svm1_home1/carns/working/src/spack/var/spack/environments/mochi-quintain/.spack-env/view/lib/libabt.so.1(+0x200f4)[0x2000002c00f4]
[h15n13:2901067] *** End of error message ***

... when running the ssg-benchmarking example on Summit with this configuration (I'm not sure if the client portion of the test is a factor or if the server group encounters this on its own):

jsrun -l cpu-cpu -a 1 -b none -r 20 -n 320 -c 2 ${BIN_DIR}/ssg-launch-group -s $((600)) -f ssg-bench.out -n scale-test verbs://mlx5_0 mpi

carns commented 2 years ago

Actually this may be related to FI_UNIVERSE_SIZE, which I did not set. I just realized that it stops at 256 by default. I'll retry this scenario with a larger size.

carns commented 2 years ago

The problem persists with FI_UNIVERSE_SIZE set to 2048, which would be sufficient for all of the server and client processes in this test case.