mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

"SWIM dping req recv error" errors with 80 servers and 1 second period #61

Open carns opened 2 years ago

carns commented 2 years ago

On Summit (OLCF), when I have 80 servers create a group with a 1 second period, I get errors like these:

SWIM dping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM dping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found

I actually don't know with certainty if they indicate critical errors or just excessively noisy output, though, because I had a bug in the script that was meant to subsequently use the group. I think the group exited though, based on execution time.

The problem was resolved by increasing the period length from 1 seconds to 3 seconds.

The execution command the triggered it looked like this:

jsrun -l cpu-cpu -a 1 -b none -r 20 -n 80 -c 2 ${BIN_DIR}/ssg-launch-group -s $((300)) -f ssg-bench.out -n scale-test verbs://mlx5_0 mpi 

And the example code can be found at https://github.com/mochi-hpc-experiments/ssg-benchmarking/blob/carns/dev-1rank-dump/ssg-launch-group.c.