mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

too many observers can crash ssg #22

Open shanedsnyder opened 3 years ago

shanedsnyder commented 3 years ago

In GitLab by @roblatham00 on Aug 12, 2020, 16:49

If one starts up a lot of ssg members on one node like this:

mpiexec -np 32 ./tests/ssg-launch-group -s 360 -f group.ssg sockets mpi &

and tries to observe that group with a small number of processes, things are ok:

./ssg-observe-group sockets build/group.ssg

If I try to observe with 64 processes, I get some errors:

SWIM dping ack recv error -- group 15324806640328145610 not found
SWIM dping req recv error -- group 15324806640328145610 not found
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury_core.c:3748
 # HG_Core_registered_data(): Could not find RPC ID in function map
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0rc1-qeng5ccan7pe4mgpwopu4cpaw6ftfcbz/spack-src/src/mercury.c:1368
 # HG_Registered_data(): Could not get registered data

On slack @shanedsnyder mentioned the 'recv error' messages are spam but they sure seem to indicate something bad is about to happen.

shanedsnyder commented 3 years ago

In GitLab by @roblatham00 on Aug 21, 2020, 12:16

This seems to often reproduce an error, but sometimes just sits there and gets stuck:

#!/bin/bash 
#BSUB -P csc332
#BSUB -W 0:15
#BSUB -nnodes 4
#BSUB -step_cgroup n
#BSUB -alloc_flags "smt4 maximizegpfs"
#BSUB -J ssg-simple 

# home is read-only on compute nodes (but not batch?) so be sure to put output
# -- like this configuration file -- on  /gpfs/alpine
SSG_STATE=/gpfs/alpine/scratch/robl/csc332/ssg.out

echo " launching group"
# experiment: how many ssg members can we launch?
# in ten minutes we could iterate up to 64 providers per node. 
jsrun -n 2 -r 1 -a 64 -g ALL_GPUS -c ALL_CPUS \
    ./ssg-launch-group -s 600  -f ${SSG_STATE} -n scale-test verbs:// mpi >/dev/null &

# would like a better way to know if group has launched
sleep 60

jsrun -n 2 -r 1 -a 256 -g ALL_GPUS -c ALL_CPUS  ./ssg-observe-group  verbs:// ${SSG_STATE}

Failure: no route to host (?)

SWIM dping req recv error -- group 7706329477877535173 not found
SWIM iping req recv error -- group 7706329477877535173 not found
# NA -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0a1-6i7sgobp4cydd6teochj72hbb3s2iwki/spack-src/src/na/na_ofi.c:4041
 # na_ofi_msg_send_unexpected(): fi_tsend(unexpected) failed, rc: -113(No route to host)
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0a1-6i7sgobp4cydd6teochj72hbb3s2iwki/spack-src/src/mercury_core.c:2057
 # hg_core_forward_na(): Could not post send for input buffer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0a1-6i7sgobp4cydd6teochj72hbb3s2iwki/spack-src/src/mercury_core.c:4718
 # HG_Core_forward(): Could not forward buffer
# HG -- Error -- /tmp/robl/spack-stage/spack-stage-mercury-2.0.0a1-6i7sgobp4cydd6teochj72hbb3s2iwki/spack-src/src/mercury.c:2092
 # HG_Forward(): Could not forward call
shanedsnyder commented 3 years ago

In GitLab by @shanedsnyder on Aug 24, 2020, 16:16

Thanks for the additional details. I can trigger both the hangs and the error messages that you shared in the issue. I can actually trigger both without even worrying about observers, just by trying to launch a 64-member group on a single node. I'll keep looking to see if I can see what's leading to the issues and keep you posted.