ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
555 stars 376 forks source link

prov/verbs: not able to use CUDA+verbs, CUDA buffer not assigned? #8672

Closed thomas-bouvier closed 6 months ago

thomas-bouvier commented 1 year ago

Hello :)

Describe the bug

For additional context, please see that issue.

I'm using libfabric through mercury. I'm trying to transfer a remote CPU variable into a local CUDA variable using provider verbs. This is failing on one of the systems I'm using.

To Reproduce

My reproducer (leveraging the Thallium API) leads to the issue on one machine, and works perfectly on another one. There should be no issue related to the code itself, and I guess the logs below are more insightful.

#include <iostream>
#include <cuda_runtime.h>
#include <thallium.hpp>

namespace tl = thallium;

int main(int argc, char** argv) {
    struct hg_init_info hii;
    memset(&hii, 0, sizeof(hii));
    hii.na_init_info.request_mem_device = true;
    tl::engine myEngine("verbs", THALLIUM_SERVER_MODE, true, 1, &hii);

    std::function<void(const tl::request&, tl::bulk&)> f =
        [&myEngine](const tl::request& req, tl::bulk& b) {
            int myArray[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
            std::vector<std::pair<void*, std::size_t>> segments;
            segments.emplace_back(&myArray[0], 10 * sizeof(int));

            tl::bulk bulk = myEngine.expose(segments, tl::bulk_mode::read_only);
            bulk >> b.on(req.get_endpoint());

            req.respond();
            myEngine.finalize();
        };
    myEngine.define("do_rdma", f);

    tl::remote_procedure remote_do_rdma = myEngine.define("do_rdma");
    tl::endpoint server_endpoint = myEngine.lookup(myEngine.self());

    int myArray[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
    int* devArray;
    cudaMalloc((void**) &devArray, 10 * sizeof(int));
    cudaMemcpy(devArray, myArray, 10 * sizeof(int), cudaMemcpyHostToDevice);
    std::vector<std::pair<void*, std::size_t>> segments;
    segments.emplace_back(&devArray[0], 10 * sizeof(int));

    struct hg_bulk_attr attr;
    memset(&attr, 0, sizeof(attr));
    attr.mem_type = (hg_mem_type_t) HG_MEM_TYPE_CUDA;
    attr.device = 0;

    tl::bulk local_bulk = myEngine.expose(segments, tl::bulk_mode::write_only, attr);
    remote_do_rdma.on(server_endpoint)(local_bulk);

    // Displaying the array which should have been modified
    int hostArray[10];
    cudaMemcpy(hostArray, devArray, 10 * sizeof(int), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 10; ++i) {
        if (i != hostArray[i])
            std::cout << "Not working!" << std::endl;
    }
    std::cout << "done" << std::endl;

    cudaFree(devArray);
    return 0;
}

I suspect the issue might be caused by something not configured properly somewhere else on my system, or a missing driver. MOFED is installed.

Output

I ran my reproducer on two machines: output 1 gemini is incorrect (has the issue), output 2 theta is correct (the reproducer runs fine). gemini.txt theta.txt

gemini around L721 (has the issue):

libfabric:30632:1678465850::ofi_rxm:core:rxm_ep_setopt():502<info> FI_OPT_MIN_MULTI_RECV set to 4096
libfabric:30632:1678465850::ofi_rxm:ep_ctrl:ofi_wait_add_fid():836<debug> Given fid (0x558b63aa7560) already added to wait list - 0x558b63ad02b0
libfabric:30632:1678465850::ofi_rxm:ep_ctrl:ofi_wait_add_fid():836<debug> Given fid (0x558b63accdb0) already added to wait list - 0x558b63ad02b0
libfabric:30632:1678465850::verbs:ep_ctrl:vrb_pep_listen():524<info> listening on: fi_sockaddr_in://10.20.0.34:38570
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b6c348000 (len: 1050664)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b6c348000 (len: 1050664)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b6c348000 len=1050664
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b6c246000 (len: 1050664)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b6c246000 (len: 1050664)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b6c246000 len=1050664
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b6c144000 (len: 1050664)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b6c144000 (len: 1050664)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b6c144000 len=1050664
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b6c042000 (len: 1050664)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b6c042000 (len: 1050664)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b6c042000 len=1050664
libfabric:30632:1678465850::ofi_rxm:av:ofi_ip_av_insertv():654<debug> inserting 1 addresses
libfabric:30632:1678465850::ofi_rxm:av:ofi_av_insert_addr():291<info> inserting addr
: fi_sockaddr_in://10.20.0.34:38570
libfabric:30632:1678465850::ofi_rxm:av:ofi_av_insert_addr():314<info> fi_addr: 0
libfabric:30632:1678465850::ofi_rxm:av:ip_av_insert_addr():624<debug> av_insert addr: fi_sockaddr_in://10.20.0.34:38570
libfabric:30632:1678465850::ofi_rxm:av:ip_av_insert_addr():626<debug> av_insert fi_addr: 0
libfabric:30632:1678465850::ofi_rxm:av:ofi_ip_av_insertv():672<debug> 1 addresses successful
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b651fc000 (len: 2097152)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b651fc000 (len: 2097152)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b651fc000 len=2097152
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b64ffa000 (len: 2097152)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b64ffa000 (len: 2097152)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b64ffa000 len=2097152
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b64df8000 (len: 2097152)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b64df8000 (len: 2097152)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b64df8000 len=2097152
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b64bf6000 (len: 2097152)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b64bf6000 (len: 2097152)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b64bf6000 len=2097152
libfabric:30632:1678465850::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b600ef000 (len: 1050664)
libfabric:30632:1678465850::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b600ef000 (len: 1050664)
libfabric:30632:1678465850::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7f5b600ef000 len=1050664
libfabric:30632:1678465851::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7f5b17400000 (len: 40)
libfabric:30632:1678465851::verbs:mr:util_mr_cache_create():266<debug> create 0x7f5b17400000 (len: 40)
libfabric:30632:1678465851::verbs:mr:util_mr_free_entry():107<debug> free 0x7f5b17400000 (len: 40)
libfabric:30632:1678465851::ofi_rxm:domain:rxm_mr_regattr():445<warn> Unable to register MSG MR

theta around L2043 (working fine):

libfabric:3173748:1679090443::ofi_rxm:ep_ctrl:ofi_wait_add_fid():836<debug> Given fid (0x55b077097910) already added to wait list - 0x55b07713bb30 
libfabric:3173748:1679090443::ofi_rxm:ep_ctrl:ofi_wait_add_fid():836<debug> Given fid (0x55b07713dab0) already added to wait list - 0x55b07713bb30 
libfabric:3173748:1679090443::verbs:ep_ctrl:vrb_pep_listen():524<info> listening on: fi_sockaddr_in://172.23.2.194:40450
libfabric:3173748:1679090443::ofi_rxm:av:ofi_ip_av_insertv():654<debug> inserting 1 addresses
libfabric:3173748:1679090443::ofi_rxm:av:ofi_av_insert_addr():291<info> inserting addr
: fi_sockaddr_in://172.23.2.194:40450
libfabric:3173748:1679090443::ofi_rxm:av:ofi_av_insert_addr():314<info> fi_addr: 0
libfabric:3173748:1679090443::ofi_rxm:av:ip_av_insert_addr():624<debug> av_insert addr: fi_sockaddr_in://172.23.2.194:40450
libfabric:3173748:1679090443::ofi_rxm:av:ip_av_insert_addr():626<debug> av_insert fi_addr: 0
libfabric:3173748:1679090443::ofi_rxm:av:ofi_ip_av_insertv():672<debug> 1 addresses successful
libfabric:3173748:1679090443::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7ff15defd000 (len: 1050664)
libfabric:3173748:1679090443::verbs:mr:util_mr_cache_create():266<debug> create 0x7ff15defd000 (len: 1050664)
libfabric:3173748:1679090443::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7ff15defd000 len=1050664
libfabric:3173748:1679090444::verbs:mr:ofi_mr_cache_search():334<debug> search 0x7ff144400000 (len: 40)
libfabric:3173748:1679090444::verbs:mr:util_mr_cache_create():266<debug> create 0x7ff144400000 (len: 40)
libfabric:3173748:1679090444::core:mr:ofi_monitor_subscribe():443<debug> subscribing addr=0x7ff144400000 len=40
libfabric:3173748:1679090444::core:mr:cuda_mm_subscribe():48<debug> Assigned CUDA buffer ID 26632 to buffer 0x7ff144400000
libfabric:3173748:1679090444::ofi_rxm:ep_ctrl:rxm_alloc_conn():422<debug> allocated conn 0x55b0a44e7db8
libfabric:3173748:1679090444::ofi_rxm:ep_ctrl:rxm_send_connect():286<debug> connecting 0x55b0a44e7db8
libfabric:3173748:1679090444::ofi_rxm:ep_ctrl:rxm_open_conn():184<debug> open msg ep 0x55b0a44e7db8
libfabric:3173748:1679090444::verbs:fabric:vrb_open_ep():1131<debug> open_ep src addr: fi_sockaddr_in://172.23.2.194:0
libfabric:3173748:1679090444::verbs:fabric:vrb_open_ep():1134<debug> open_ep dest addr: fi_sockaddr_in://172.23.2.194:40450

What could cause a buffer ID not being assigned by cuda_mm_subscribe() in the gemini output?

I tried to disable the MR cache with FI_MR_CACHE_MAX_COUNT=0 didn't change anything. The issue seems to be caused by something else. gemini_mr_disabled.txt

Environment:

MOFED drivers are installed on both systems.

Input spec
--------------------------------
mochi-thallium@main
   ^argobots
   ^libfabric@1.16.1 +cuda fabrics=rxm,tcp,verbs
   ^mercury@2.2.0 ~boostsys~checksum+ofi

Thanks!

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.