mercury-hpc / mercury

Mercury is a C library for implementing RPC, optimized for HPC.
http://www.mcs.anl.gov/projects/mercury/
BSD 3-Clause "New" or "Revised" License
163 stars 60 forks source link

Getting `HG_FAULT` when performing RDMA on data living in CUDA memory #664

Closed thomas-bouvier closed 2 months ago

thomas-bouvier commented 1 year ago

Hello :)

Describe the bug

I'm trying to use RDMA to transfer a remote CPU variable to a local variable living in CUDA memory. First of all, is that use case supported? More generally, are the following scenarios supported:

If the later scenario is not supported, then this issue is irrelevant.

This is the error I'm getting:

[1,0]<stderr>:libfabric:57702:1678406492::verbs:mr:ofi_mr_cache_search():334<debug> search 0x40fe8e00000 (len: 602112)
[1,0]<stderr>:libfabric:57702:1678406492::verbs:mr:util_mr_cache_create():266<debug> create 0x40fe8e00000 (len: 602112)
[1,0]<stderr>:libfabric:57702:1678406492::verbs:mr:util_mr_free_entry():107<debug> free 0x40fe8e00000 (len: 602112)
[1,0]<stderr>:libfabric:57702:1678406492::ofi_rxm:domain:rxm_mr_regattr():445<warn> Unable to register MSG MR
[1,0]<stderr>:Function returned HG_FAULT
[1,0]<stderr>:terminate called after throwing an instance of 'thallium::margo_exception'
[1,0]<stderr>:  what():  [/mnt/spack/linux-debian11-broadwell/gcc-10.2.1/mochi-thallium-main-wppyqz2fdrp24omimgaom4fiau7fotop/include/thallium/engine.hpp:1132][margo_bulk_create] Function returned HG_FAULT

I initialized Mercury with device memory support and MOFED is installed on the machines I'm using. I've tested on a DGX-1 cluster (part of the grid5000 testbed) and on a node on Cooley: both experiments yield to the same error.

To Reproduce

This example is using the Thallium API. I can try to rewrite it if needed.

The remote variable is an array of increasing integers stored on the CPU. The local variable is an array of the same size containing zeros and stored in CUDA memory. At the end of the program, I expect devArray to contain {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

I'm eventually moving devArray to the CPU for the purpose of printing it (hostArray variable). The program doesn't reach that line though, throwing the HG_FAULT before that.

#include <iostream>
#include <cuda_runtime.h>
#include <thallium.hpp>

namespace tl = thallium;

int main(int argc, char** argv) {
    struct hg_init_info hii;
    memset(&hii, 0, sizeof(hii));
    hii.na_init_info.request_mem_device = true;
    tl::engine myEngine("verbs", THALLIUM_SERVER_MODE, true, 1, &hii);

    std::function<void(const tl::request&, tl::bulk&)> f =
        [&myEngine](const tl::request& req, tl::bulk& b) {
            int myArray[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
            std::vector<std::pair<void*, std::size_t>> segments;
            segments.emplace_back(&myArray[0], 10 * sizeof(int));

            tl::bulk bulk = myEngine.expose(segments, tl::bulk_mode::read_only);
            bulk >> b.on(req.get_endpoint());

            req.respond();
            myEngine.finalize();
        };
    myEngine.define("do_rdma", f);

    tl::remote_procedure remote_do_rdma = myEngine.define("do_rdma");
    tl::endpoint server_endpoint = myEngine.lookup(myEngine.self());

    int myArray[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
    int* devArray;
    cudaMalloc((void**) &devArray, 10 * sizeof(int));
    cudaMemcpy(devArray, myArray, 10 * sizeof(int), cudaMemcpyHostToDevice);
    std::vector<std::pair<void*, std::size_t>> segments;
    segments.emplace_back(&devArray[0], 10 * sizeof(int));

    struct hg_bulk_attr attr;
    memset(&attr, 0, sizeof(attr));
    attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_CUDA;
    attr.device = 0;

    tl::bulk local_bulk = myEngine.expose(segments, tl::bulk_mode::write_only, attr);
    remote_do_rdma.on(server_endpoint)(local_bulk);

    // Displaying the array which should have been modified
    int hostArray[10];
    cudaMemcpy(hostArray, devArray, 10 * sizeof(int), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 10; ++i) {
        if (i != hostArray[i])
            std::cout << "Not working!" << std::endl;
    }
    std::cout << "done" << std::endl;

    cudaFree(devArray);
    return 0;
}

Platform (please complete the following information):

 Input spec
--------------------------------
mochi-thallium@main
    ^argobots
    ^libfabric+cuda fabrics=rxm,tcp,verbs
    ^mercury@git.d34cec49a18621c64b2be2bfce69d5b3e988ceab~boostsys~checksum+ofi

Additional context

Here are some additional logs with FI_LOG_LEVEL=debug HG_LOG_LEVEL=debug HG_SUBSYS_LOG=na

DGX-1 cluster gemini.txt

Cooley cooley.txt

Please note that I also noticed these lines on Cooley:

libfabric:12942:1678470176:ofi_rxm:verbs:core:vrb_check_hints():241<info> Supported: FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_LOCAL_COMM, FI_REMOTE_COMM
libfabric:12942:1678470176:ofi_rxm:verbs:core:vrb_check_hints():241<info> Requested: FI_MSG, FI_RMA, FI_READ, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_HMEM

Thank you!

soumagne commented 1 year ago

All 3 use cases should be supported as far as I know. The error indicates that there is a memory registration issue, not a transfer issue. Can you try again by turning off MR cache, FI_MR_CACHE_MONITOR=disabled or try turning on cuda cache monitor with FI_MR_CUDA_CACHE_MONITOR_ENABLED=1. Also have you verified that your cuda device ID is 0 ?

thomas-bouvier commented 1 year ago

Thank you for your answer!

Unfortunately, disabling the MR cache with FI_MR_CACHE_MAX_COUNT=0 didn't change anything. The issue seems to be caused by something else. gemini_mr_disabled.txt

I didn't spot any major difference with FI_MR_CUDA_CACHE_MONITOR_ENABLED=1 enabled either. gemini_cuda_monitor_enabled.txt

I don't really understand what the device ID is referring to. There are 8 GPUs on the DGX-1 cluster I'm using, ranks are [0-7], I guess 0 should work?

I ran my reproducer on another machine where it works (Theta). I'm attaching the logs below. The first line that is different from the gemini logs above is L2043, where cuda_mm_subscribe() is called. This results in the following: Assigned CUDA buffer ID 26632 to buffer 0x7ff144400000. The corresponding line on gemini (my non-working setup) is L721, and I don't see any buffer ID being assigned anywhere from there. theta.txt

The mystery remains...

soumagne commented 2 months ago

Closing for now, please re-open the libfabric issue if needed.