mochi-hpc / mochi-bedrock

Mochi bootstrapping service.
https://mochi.readthedocs.io
Other
0 stars 1 forks source link

Repeated query of bedrock fails with ssg errors #9

Closed roblatham00 closed 2 years ago

roblatham00 commented 2 years ago

consider this excessively minimal json file:

{
        "__skeleton_comment__":"The only item in this file is a directive telling SSG to save its state to a known file so we can attach to it later",
        "ssg":[
                {"group_file":"skeleton.ssg"
                }
        ]
}

I can launch a bedrock server with it, relying on all the default pools and xstreams:

bedrock -c skeleton.json tcp &
[2022-04-19 16:28:43.920] [info] Bedrock daemon now running at ofi+tcp;ofi_rxm://172.21.105.239:46827

I can get back the fully populated json:

bedrock-query -p -s skeleton.ssg tcp > bedrock-skeleton.json

... but only once. The second time:

% bedrock-query -p -s skeleton.ssg tcp > bedrock-skeleton.json
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG unable to send group refresh request to target [HG_CHECKSUM_ERROR]
Error: SSG exceeded max retries for refreshing group

Environment:

% spack find --loaded
==> 15 loaded packages
-- linux-ubuntu21.10-icelake / gcc@11.2.0 -----------------------
argobots@1.1  libfabric@1.14.0    mochi-bedrock@main  mochi-thallium@0.9.1  openssl@1.1.1l
cereal@1.3.0  mercury@2.1.0       mochi-margo@0.9.7   mpich@master          spdlog@1.9.2
json-c@0.15   mochi-abt-io@0.5.1  mochi-ssg@0.5.2     nlohmann-json@3.10.4  tclap@1.2.2

mochi-bedrock is from 3 March... maybe I need to update it?

mdorier commented 2 years ago

I couldn't reproduce the issue. Could you try with bedrock 0.4.1 (the latest version)? There hasn't been much changes in bedrock since March (at least nothing that should affect SSG).

The only differences in my environment are margo 0.9.8, thallium 0.10.1.

roblatham00 commented 2 years ago

Environment -- new bedrock, margo, thallium:

==> 20 installed packages
-- linux-ubuntu21.10-icelake / gcc@11.2.0 -----------------------
argobots@1.1     benvolio@main  json-c@0.15       mercury@2.1.0        mochi-margo@0.9.8      mpich@master          spdlog@1.9.2
autoconf@2.69    cereal@1.3.2   libfabric@1.14.0  mochi-abt-io@0.5.1   mochi-ssg@0.5.2        nlohmann-json@3.10.4  tclap@1.2.2
automake@1.16.3  fmt@8.0.1      libtool@2.4.6     mochi-bedrock@0.4.1  mochi-thallium@0.10.1  openssl@1.1.1l

No problems repeatedly querying bedrock, if I use the 'sm' protocoL.

Using tcp, first query works, second gives errors as in earlier report.

gdb tells me everyone is stuck in epoll_wait

Re-ran with 'trace' debugging:

a working query emits these lines:

% bedrock-query -v trace -s skeleton.ssg tcp
[2022-04-20 14:44:45.725] [trace] Spawning ULT for ssg_group_refresh_recv_ult RPC (handle = 0x55eec57b1ff0)
[2022-04-20 14:44:45.725] [trace] Starting RPC ssg_group_refresh_recv_ult (handle = 0x55eec57b1ff0)
[2022-04-20 14:44:45.728] [trace] RPC ssg_group_refresh_recv_ult completed (handle = 0x55eec57b1ff0)
[2022-04-20 14:44:45.728] [trace] Spawning ULT for thallium_generic_rpc RPC (handle = 0x55eec5916f40)
[2022-04-20 14:44:45.728] [trace] Starting RPC thallium_generic_rpc (handle = 0x55eec5916f40)
[2022-04-20 14:44:45.729] [trace] RPC thallium_generic_rpc completed (handle = 0x55eec5916f40)

a second not working query gives me no bedrock traces at all. Perhaps it is an ssg bug?

mdorier commented 2 years ago

I would suggest you try a pure SSG code (there are example on the mochi readthedocs that should be pretty close to what you want to do).

mdorier commented 2 years ago

What variants did you use with mercury, libfabric, etc.?

roblatham00 commented 2 years ago
% spack find -fvl
==> 20 installed packages
-- linux-ubuntu21.10-icelake / gcc@11.2.0 -----------------------
wjx266f argobots@1.1%gcc ~affinity~debug+perf~stackunwind~tool~valgrind stackguard=none
tomvopl autoconf@2.69%gcc  patches=7793209b33013dc0f81208718c68440c5aae80e7a1c4b8d336e382525af791a7
uhrsewg automake@1.16.3%gcc
i3n4zab benvolio@main%gcc ~cray-drc+mpi~pmix
og4yenx cereal@1.3.2%gcc ~ipo build_type=RelWithDebInfo patches=2dfa0bff9816d0ebd8a1bcc70ced4483b3cda83a982ea5027f1aaadceaa15aac
ksvfp2i fmt@8.0.1%gcc ~ipo+pic~shared build_type=RelWithDebInfo cxxstd=11
h5xvhhs json-c@0.15%gcc ~ipo build_type=RelWithDebInfo
6amqpqz libfabric@1.14.0%gcc ~debug~disable-spinlocks~kdreg fabrics=sockets,tcp,udp
bij56oa libtool@2.4.6%gcc
q7vr7ut mercury@2.1.0%gcc ~bmi~boostsys~cci+checksum~debug~ipo~mpi+ofi+shared+sm~ucx~udreg build_type=RelWithDebInfo
fknn46s mochi-abt-io@0.5.1%gcc
tpaxgeg mochi-bedrock@0.4.1%gcc ~ipo~mpi build_type=RelWithDebInfo
tcd4otg mochi-margo@0.9.8%gcc ~pvar
3gdsjzx mochi-ssg@0.5.2%gcc ~drc+mpi~pmix~valgrind patches=f23321ff82fad59b2abb6bab5e6b017cabfcf5ef3a9910b93133799fe963c18a
m2ueydy mochi-thallium@0.10.1%gcc +cereal~ipo build_type=RelWithDebInfo
42gmcmk mpich@master%gcc ~argobots~benvolio+fortran+hwloc+hydra+libxml2+pci+romio~slurm~verbs+wrapperrpath device=ch4 netmod=ofi pmi=pmi
f4jugop nlohmann-json@3.10.4%gcc ~ipo~multiple_headers build_type=RelWithDebInfo
sxna2ke openssl@1.1.1l%gcc ~docs certs=system
imb7x3y spdlog@1.9.2%gcc ~ipo+shared build_type=RelWithDebInfo
quizsne tclap@1.2.2%gcc 
mdorier commented 2 years ago

I usually have tcp,rxm as fabrics enabled for libfabric. Can you try with that?

roblatham00 commented 2 years ago

minimal ssg also has problems, but it appears to be something special about my setup. Not bedrock's fault. closing.

mdorier commented 2 years ago

You might want to try just a simple margo program next. Maybe it's not even SSG...