mercury-hpc / mercury

Mercury is a C library for implementing RPC, optimized for HPC.
http://www.mcs.anl.gov/projects/mercury/
BSD 3-Clause "New" or "Revised" License
168 stars 62 forks source link

RPC client-server does not work between macos and linux #616

Open jspanchu opened 2 years ago

jspanchu commented 2 years ago

Describe the bug The hello world thallium RPC example doesn't work in a heterogeneous environment (mac + linux). See hello-world. I modified the source to use 'sockets' provider instead of TCP. I am posting this here because the error messages come from mercury and maybe libfabric?

Run the server on mac:

~/hello-thallium  $ ./server
Server running at address ofi+sockets://10.50.58.248:39517
# [80739.928023] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:2431
 # na_ofi_addr_map_insert(): fi_av_insert() failed, inserted: 0
# [80739.928109] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:2320
 # na_ofi_addr_key_lookup(): Could not insert new address
# [80739.928120] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:4756
 # na_ofi_cq_process_recv_unexpected_event(): Could not lookup address
# [80739.928128] mercury->msg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:4680
 # na_ofi_cq_process_event(): Could not process unexpected recv event
# [80739.928156] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3917
 # hg_core_progress_na(): Could not make progress on NA (NA_PROTOCOL_ERROR)
# [80739.928167] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3809
 # hg_core_poll_wait(): hg_core_progress_na() failed
# [80739.928173] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3708
 # hg_core_progress(): Could not make blocking progress on context
# [80739.928180] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:5077
 # HG_Core_progress(): Could not make progress
# [80739.928208] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury.c:2074
 # HG_Progress(): Could not make progress on context (HG_PROTOCOL_ERROR)
[critical] unexpected return code (12: HG_PROTOCOL_ERROR) from HG_Progress()
Assertion failed: (0), function __margo_hg_progress_fn, file margo-core.c, line 1659.
zsh: abort      ./server

and client on Linux:

$ ./client ofi+sockets://10.50.58.248:39517

I get the same output for a client on mac and a server on linux.

To Reproduce Steps to reproduce the behavior: On macOS, spack installs argobots@1.1 which simply crashes the server (segmentation fault), so use argobots@main on both Linux and mac with this command.

$ spack install mochi-thallium@develop^libfabric fabrics=tcp,rxm,sockets ^argobots@main
$ spack load mochi-thallium@develop

Compile

  1. server.cpp
    
    // c++ --std=c++14 -o server server.cpp `pkg-config --cflags --libs thallium`
    #include <iostream>
    #include <thallium.hpp>

namespace tl = thallium;

void hello(const tl::request& req) { std::cout << "Hello World!" << std::endl; }

int main(int argc, char** argv) { HG_Set_log_level("debug"); tl::engine myEngine("sockets", THALLIUM_SERVER_MODE); myEngine.define("hello", hello).disable_response(); std::cout << "Server running at address " << myEngine.self() << std::endl;

return 0;

}

2. client.cpp
```cpp
// c++ --std=c++14 -o server server.cpp `pkg-config --cflags --libs thallium`
#include <thallium.hpp>

namespace tl = thallium;

int main(int argc, char** argv) {

    if(argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <address>" << std::endl;
        exit(0);
    }

    tl::engine myEngine("sockets", THALLIUM_CLIENT_MODE);
    tl::remote_procedure hello = myEngine.define("hello").disable_response();
    tl::endpoint server = myEngine.lookup(argv[1]);
    hello.on(server)();

    return 0;
}

Platforms: MacOS: Monterey 12.5.1 on M1 with clang-13.1.6 Linux: Ubuntu 22.04 with GCC 11.2.0

Here's output of spack spec mochi-thallium on each platform.

# macOS
$ spack spec mochi-thallium 
Input spec
--------------------------------
mochi-thallium

Concretized
--------------------------------
mochi-thallium@develop%apple-clang@13.1.6+cereal~ipo build_type=RelWithDebInfo arch=darwin-monterey-m1
    ^cereal@1.3.2%apple-clang@13.1.6~ipo build_type=RelWithDebInfo patches=2dfa0bf arch=darwin-monterey-m1
        ^cmake@3.23.3%apple-clang@13.1.6~doc+ncurses+ownlibs~qt build_type=Release arch=darwin-monterey-m1
            ^ncurses@6.2%apple-clang@13.1.6~symlinks+termlib abi=none arch=darwin-monterey-m1
                ^gnuconfig@2021-08-14%apple-clang@13.1.6 arch=darwin-monterey-m1
                ^pkgconf@1.8.0%apple-clang@13.1.6 arch=darwin-monterey-m1
            ^openssl@1.1.1q%apple-clang@13.1.6~docs~shared certs=mozilla patches=3fdcf2d arch=darwin-monterey-m1
                ^ca-certificates-mozilla@2022-07-19%apple-clang@13.1.6 arch=darwin-monterey-m1
                ^perl@5.34.1%apple-clang@13.1.6+cpanm+shared+threads arch=darwin-monterey-m1
                    ^berkeley-db@18.1.40%apple-clang@13.1.6+cxx~docs+stl patches=b231fcc arch=darwin-monterey-m1
                    ^bzip2@1.0.8%apple-clang@13.1.6~debug~pic+shared arch=darwin-monterey-m1
                        ^diffutils@3.8%apple-clang@13.1.6 arch=darwin-monterey-m1
                            ^libiconv@1.16%apple-clang@13.1.6 libs=shared,static arch=darwin-monterey-m1
                    ^gdbm@1.19%apple-clang@13.1.6 arch=darwin-monterey-m1
                        ^readline@8.1.2%apple-clang@13.1.6 arch=darwin-monterey-m1
                    ^zlib@1.2.12%apple-clang@13.1.6+optimize+pic+shared patches=0d38234 arch=darwin-monterey-m1
    ^mochi-margo@develop%apple-clang@13.1.6~debug~pvar arch=darwin-monterey-m1
        ^argobots@main%apple-clang@13.1.6~affinity~debug~lazy_stack_alloc+perf~stackunwind~tool~valgrind stackguard=none arch=darwin-monterey-m1
            ^autoconf@2.69%apple-clang@13.1.6 patches=35c4492,7793209,a49dd5b arch=darwin-monterey-m1
                ^m4@1.4.19%apple-clang@13.1.6+sigsegv patches=9dc5fbd,bfdffa7 arch=darwin-monterey-m1
                    ^libsigsegv@2.13%apple-clang@13.1.6 arch=darwin-monterey-m1
            ^automake@1.16.5%apple-clang@13.1.6 arch=darwin-monterey-m1
            ^libtool@2.4.7%apple-clang@13.1.6 arch=darwin-monterey-m1
        ^json-c@0.16%apple-clang@13.1.6~ipo build_type=RelWithDebInfo arch=darwin-monterey-m1
        ^mercury@master%apple-clang@13.1.6~bmi+boostsys+checksum~debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg build_type=RelWithDebInfo arch=darwin-monterey-m1
            ^boost@1.79.0%apple-clang@13.1.6+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug+exception~fiber+filesystem+graph~graph_parallel~icu+iostreams~json+locale+log+math~mpi+multithreaded~nowide~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded~stacktrace+system~taggedlayout+test+thread+timer~type_erasure~versionedlayout+wave cxxstd=98 patches=a440f96 visibility=hidden arch=darwin-monterey-m1
            ^libfabric@1.15.1%apple-clang@13.1.6~debug~disable-spinlocks~kdreg fabrics=rxm,sockets,tcp arch=darwin-monterey-m1
# linux
spack spec mochi-thallium 
Input spec
--------------------------------
mochi-thallium

Concretized
--------------------------------
mochi-thallium@develop%gcc@11.2.0+cereal~ipo build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
    ^cereal@1.3.2%gcc@11.2.0~ipo build_type=RelWithDebInfo patches=2dfa0bf arch=linux-ubuntu22.04-icelake
        ^cmake@3.23.2%gcc@11.2.0~doc+ncurses+ownlibs~qt build_type=Release arch=linux-ubuntu22.04-icelake
            ^ncurses@6.2%gcc@11.2.0~symlinks+termlib abi=none arch=linux-ubuntu22.04-icelake
                ^pkgconf@1.8.0%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
            ^openssl@1.1.1q%gcc@11.2.0~docs~shared certs=mozilla patches=3fdcf2d arch=linux-ubuntu22.04-icelake
                ^ca-certificates-mozilla@2022-03-29%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                ^perl@5.34.1%gcc@11.2.0+cpanm+shared+threads arch=linux-ubuntu22.04-icelake
                    ^berkeley-db@18.1.40%gcc@11.2.0+cxx~docs+stl patches=b231fcc arch=linux-ubuntu22.04-icelake
                    ^bzip2@1.0.8%gcc@11.2.0~debug~pic+shared arch=linux-ubuntu22.04-icelake
                        ^diffutils@3.8%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                            ^libiconv@1.16%gcc@11.2.0 libs=shared,static arch=linux-ubuntu22.04-icelake
                    ^gdbm@1.19%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                        ^readline@8.1.2%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                    ^zlib@1.2.12%gcc@11.2.0+optimize+pic+shared patches=0d38234 arch=linux-ubuntu22.04-icelake
    ^mochi-margo@develop%gcc@11.2.0~pvar arch=linux-ubuntu22.04-icelake
        ^argobots@main%gcc@11.2.0~affinity~debug~lazy_stack_alloc+perf~stackunwind~tool~valgrind stackguard=none arch=linux-ubuntu22.04-icelake
            ^autoconf@2.69%gcc@11.2.0 patches=35c4492,7793209,a49dd5b arch=linux-ubuntu22.04-icelake
                ^m4@1.4.19%gcc@11.2.0+sigsegv patches=9dc5fbd,bfdffa7 arch=linux-ubuntu22.04-icelake
                    ^libsigsegv@2.13%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
            ^automake@1.16.5%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
            ^libtool@2.4.7%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
        ^json-c@0.15%gcc@11.2.0~ipo build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
        ^mercury@master%gcc@11.2.0~bmi+boostsys+checksum~debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
            ^boost@1.79.0%gcc@11.2.0+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug+exception~fiber+filesystem+graph~graph_parallel~icu+iostreams~json+locale+log+math~mpi+multithreaded~nowide~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded~stacktrace+system~taggedlayout+test+thread+timer~type_erasure~versionedlayout+wave cxxstd=98 patches=a440f96 visibility=hidden arch=linux-ubuntu22.04-icelake
            ^libfabric@1.15.1%gcc@11.2.0~debug~disable-spinlocks~kdreg fabrics=rxm,sockets,tcp arch=linux-ubuntu22.04-icelake
soumagne commented 1 year ago

we should investigate what is the right solution for that now as anything that uses OFI's sockets provider will be unsupported.