ucp active messages take an excessive number of ucp_worker_progress calls

tjcw commented 4 years ago

I have a test case which calls MPI_Ibcast to do a nonblocking broadcast of a 32MB buffer around 4 ranks on 4 nodes, over Infiniband. I wait for completion of the MPI_Ibcast with a loop which does some floating-point work and then calls MPI_Testall. I am using an IBM proprietary collectives library which works by sending active messages, which can run over either IBM PAMI or UCP. By comparison with the IBM PAMI implementation, the UCP implementation requires about 10x as many calls to 'advance' (ucp_worker_progress). This results in a 10x longer elapsed time to complete the broadcasts. I am running with a 'release' build of UCX from the current master branch. Is UCP breaking the buffer into short segments, and requiring a call to ucp_worker_progress for each segment ? Here run_abbrev.log is a trace of the run; if you search for 'advancecount' you will see a count of the number of calls to ucp_worker_progress at each stage.

snyjm-18 commented 4 years ago

I agree that seems like an excessive number of calls. Are you checking the return value of ucp_worker_progress?

tjcw commented 4 years ago

I'm currently ignoring the return value. What would you like me to do with the return value ?

snyjm-18 commented 4 years ago

The return value indicates how much work it did. I believe if it returns 0, no work was done, so the send has completed, so there is not reason to keep calling it. @yosefe @shamisp please confirm.

tjcw commented 4 years ago

I am now reporting how many times ucp_worker_progress comes back with a non-zero return code; see 'progressedCount' in the new trace file here run_abbrev_2.log . The 'progressedCount' values are much lower than the 'advancecount' values. The application keeps calling ucp_worker_progress (each time after doing its floating-point work) until the MPI_Ibcast completes; even if one call to ucp_worker_progress comes back zero, a transfer can complete on the Infiniband (incoming or outgoing) which will enable a subsequent ucp_worker_progress call to make progress. So I still don't know why IBM PAMI requires so many fewer calls to 'advance' to complete the MPI_Ibcast. Some of this may be that IBM PAMI establishes all communication endpoints at startup, whereas my implementation on UCX defers establishing communication endpoints until the first time a rank needs to send an active message to another rank; but this is not the whole story.

snyjm-18 commented 4 years ago

I'm interested in where the time is going. It doesn't seem like its on the send side and may be on the receive side. If its on the receive side, it may be related to your other open issue.

tjcw commented 4 years ago

Here is the trace from the same test but using IBM PAMI run_pami_abbrev.log . I haven't annotated it with 'advance' counts because in this case progress is done by a different method which makes it difficult to count. The trace shows that transfers over the IB are being done in 32MB units, i.e. the application broadcast buffer is being transferred as a single chunk. In the UCX run, are the transfers done with the 'iovec' as a single chunk, or is it split down into smaller blocks with a ucp_worker_progress call needed for each block ?

snyjm-18 commented 4 years ago

The iovec will not be transferred as one chunk because 32MB won't fit in a single packet. It should not require a ucp_worker_progress call for each packet. If it required a call for each packet, the number of non-zero ucp_worker_progress calls would be much higher than those in run_abbrev_2.log. I am not 100% on this and @yosefe or @shamisp should confirm.

yosefe commented 4 years ago

By comparison with the IBM PAMI implementation, the UCP implementation requires about 10x as > many calls to 'advance' (ucp_worker_progress). This results in a 10x longer elapsed time to > complete the broadcasts. I am running with a 'release' build of UCX from the current master branch.

Each ucp_worker_progress call processes a single packet. 32MB would be split to ~4000 packets of 8k over a typical IB transport. However, the mere overhead of progress function call is very small compared to the time it takes to receive a 8k message, so i highly doubt this is the root cause. If the frequency of calling progress/advance is too low, that is - a better overlap between communication and computation is required - i'd suggest to implement large UCP active messages over RDMA, or use the tagged API for the collectives library broadcast. Or, call ucp_worker_progress() multiple times from the equivalent of PAMI_Context_Advance

shamisp commented 4 years ago

@yosefe on sender side you don't have to generate completion for every send

tjcw commented 4 years ago

As @yosefe suggested, I think I need to implement large UCP active messages over RDMA. I think the way to do this is to take a copy of src/ucp/tag/rndv.c and revise it so that it works for active messages. I have started a branch here https://github.com/tjcw/ucx/tree/tjcw-am-rdma for this work. I am interested in implementing for the case where the client passes an 'iovec' to the active message call. I think I will end up implementing another function with the same signature as 'ucp_am_send_nb' for the client to call if it wants to issue an active message with the data transferred by RDMA.

snyjm-18 commented 4 years ago

I am assuming that more frequent ucp_worker_progress() calls did not solve the performance issue.

If you need any help, please let me know. I would try to keep the API the same, and only add a flag if possible. Although it seems you need a new function call to specify the RDMA region.

tjcw commented 4 years ago

For many applications, it will not be practical to call ucp_worker_progress() more frequently; the application will be coded to do a certain amount of work, then call MPI_Test*() which will call opal_progress() which will call ucp_worker_progress() amongst other things.

tjcw commented 4 years ago

I have started work on active messages via RDMA. Code is in https://github.com/tjcw/ucx/tree/tjcw-am-rdma . I am intending to implement UCP AM as 3 UCT active messages; a 'request' flowing from the client to the server, a 'response' flowing from the server to the client; and a 'completion' flowing from the client to the server. The client will perform an RDMA write between the 'response' and 'completion' messages. Most of my revisions so far are in ucp_am.h ; the revised section looks like

typedef struct {
  size_t            total_size; /* length of buffer needed for all data */
  uint64_t          msg_id;     /* method to match parts of the same AM */
  uintptr_t         ep;         /* end point ptr, used for maintaing list
                                   of arrivals */
  uint16_t          am_id;      /* index into callback array */
} UCS_S_PACKED ucp_am_rdma_header_t ;

enum {
  packed_rkey_max_size = 32    /* Max supported size for a packed rkey */
};

typedef struct {
  uint64_t          msg_id;     /* method to match parts of the same AM */
  uintptr_t         ep;         /* end point ptr, used for maintaing list
                                   of arrivals */
  uint16_t          am_id;      /* index into callback array */
  uintptr_t         address;     /* Address for RDMA */
  char              rkey_buffer[packed_rkey_max_size] ; /* Packed remote key */
} UCS_S_PACKED ucp_am_rdma_reply_header_t ;

typedef struct {
  uint64_t          msg_id;     /* method to match parts of the same AM */
  uintptr_t         ep;         /* end point ptr, used for maintaing list
                                   of arrivals */
  uint16_t          am_id;      /* index into callback array */
} UCS_S_PACKED ucp_am_rdma_completion_header_t ;

typedef struct {
  ucs_list_link_t   list;       /* entry into list of unfinished AM's */
  ucp_request_t    *req;        /* active message request */
  uint64_t          msg_id;     /* way to match up all parts of AM */
} ucp_am_rdma_client_unfinished_t ;

typedef struct {
  ucs_list_link_t   list;       /* entry into list of unfinished AM's */
  ucp_recv_desc_t  *all_data;   /* buffer for all parts of the AM */
  uint64_t          msg_id;     /* way to match up all parts of AM */
} ucp_am_rdma_server_unfinished_t ;

@snyjm-18 Does this look reasonable ? Also is there a ucp function call to do the RDMA ?

snyjm-18 commented 4 years ago

Looks mostly fine. Couple small things. packed_rkey_max_size -> UCP_PACKED_RKEY_MAX_SIZE There are some other style things that need to be changed: https://github.com/openucx/ucx/blob/master/doc/CodeStyle.md

For ucp_am_rdma_client_unfinished_t, this is to maintain a list of requests inbetween request and response messages on the client side? If so, it all makes sense to me.

Yossi is probably the best person to ask about ucp functions to do the rdma. May want to look at upc/rma/rma_send.c . Looks like ucp_rma_nonblocking(...) may be what you are looking for.

tjcw commented 4 years ago

I have a first pass at the code, but it fails an assert in the AM server where it tries to find the endpoint of the client to save context and with a view to replying. My code is here https://github.com/tjcw/ucx/blob/tjcw-am-rdma/src/ucp/core/ucp_am.c ; the assert message is

[c712f6n02:60277:0:60277]  ucp_worker.h:308  Assertion `ep->worker == worker' failed: worker=0x100012220010 ep=0x10001bff0000 ep->worker=(nil)

and the piece of my code which trips the assert is

static ucs_status_t
ucp_am_rdma_handler(void *am_arg, void *am_data, size_t am_length,
                    unsigned am_flags)
{
    ucp_worker_h worker            = (ucp_worker_h)am_arg;
    ucp_am_hdr_t *hdr              = (ucp_am_hdr_t *)am_data;
    ucp_am_rdma_header_t *rdma_hdr = (ucp_am_rdma_header_t *)(hdr+1);
    ucp_ep_h ep                    = ucp_worker_get_ep_by_ptr(worker,
                                                           rdma_hdr->ep);

@yosefe @snyjm-18 I could do with some help figuring out what I am doing wrong. I'm assuming the UCT active message buffer on the server will contain the 8-byte item passed as 'header' to uct_ep_am_short, followed by the payload as passed to uct_ep_am_short, but I haven't seen this documented so I may be wrong. Also, where does the output from 'ucs_print' go ?

tjcw commented 4 years ago

It looks like the problem is that I didn't understand what ucp_worker_get_ep_by_ptr does. So @yosefe @snyjm-18 how does the server of an AM find the endpoint of the client, if it needs to reply ?

snyjm-18 commented 4 years ago

@tjcw that is what ucp_worker_get_ep_by_ptr does. There are examples to use it in the current ucp_am implementation, and also in eager_send.c for implementing ssend. I looked at your branch, where does rdma_hdr->ep get set before sending?

tjcw commented 4 years ago

@snyjm-18 I have this code

unfinished->rdma_header.ep         = (uintptr_t) ep ;

at line 652 of ucp_am.c, but this sets the 'ep' field to the server's endpoint in the client address space --- which is wrong. It needs to be the client's address in the server address space. How do I find this value ?

snyjm-18 commented 4 years ago

reply_hdr->ep_ptr = ucp_request_get_dest_ep_ptr(req); on the client and then reply_ep = ucp_worker_get_ep_by_ptr(worker, hdr->ep_ptr); on the server.

tjcw commented 4 years ago

I've made the change for

reply_hdr->ep_ptr = ucp_request_get_dest_ep_ptr(req);

but ucp_request_get_dest_ep_ptr is coming back with '0', and so in the AM server I am failing the assert at line 306. What should be setting up the value in the 'req' ?

snyjm-18 commented 4 years ago

Just try status = ucp_ep_resolve_dest_ep_ptr(ep, ep->am_lane); on the client side before calling ucp_am_send_req. It's used line 455 in ucp_am.c It looks like you are already doing that. I can take a closer look tonight.

tjcw commented 4 years ago

My code looks like

    ucp_am_send_req_init(req, ep, &(unfinished->rdma_header), UCP_DATATYPE_CONTIG, sizeof(ucp_am_rdma_header_t), flags, id);
    status = ucp_ep_resolve_dest_ep_ptr(ep, ep->am_lane);
    if (ucs_unlikely(status != UCS_OK)) {
        ret = UCS_STATUS_PTR(status);
        goto out;
    }

    length = ucp_dt_iov_length(iovec, count);
    unfinished->rdma_header.total_size = length ;
    unfinished->rdma_header.msg_id     = req->send.am.message_id ;
    unfinished->rdma_header.ep_ptr      = ucp_request_get_dest_ep_ptr(req) ;

( https://github.com/tjcw/ucx/blob/tjcw-am-rdma/src/ucp/core/ucp_am.c#L642 ) . but I'm still getting ep_ptr being 0. What could make ucp_ep_resolve_dest_ep_ptr not to its job but still come back with UCS_OK ?

tjcw commented 4 years ago

Does the worker progress function need to be called a few times for the wireup to complete and the destination ep_ptr to become valid ? In my case I'm trying to do the AM-over-RDMA as the first application communication to be done between these communication partners. I'll try making a modification to the active message 'gtest' to flow a test active message over RDMA, to give you a test case.

snyjm-18 commented 4 years ago

That is a question for Yossi, but I think if wireup is not completed, it may not know the ep_ptr yet.

tjcw commented 4 years ago

@snyjm-18 I have revised test_ucp_am::do_send_process_data_iov_test to add an AM-over-RDMA test. I'm running all the gtests at the moment; how can I run just the active message gtest ?

snyjm-18 commented 4 years ago

./gtest --gtest_filter=ucp_am*

tjcw commented 4 years ago

I've figured out how to run just the ucp am tests

./gtest --gtest_filter=*ucp_am*

does it. Running this with my modified gtest fails the same assert that I reported above;

[ RUN      ] dcx/test_ucp_am.send_process_iov_am/0
[1567782322.337650] [f8n02:10545:0]         ucp_am.c:611  UCX  WARN  AM RDMA am_id=0
[1567782322.338459] [f8n02:10545:0]         ucp_am.c:232  UCX  WARN  AM RDMA ucp_am_send_rdma_short header=0x000000000000003a
[1567782322.338469] [f8n02:10545:0]         ucp_am.c:234  UCX  WARN  AM RDMA ucp_am_send_rdma_short payload=(total_size=64,msg_id=0x0,ep_ptr=0,am_id=0)
[1567782322.338483] [f8n02:10545:0]         ucp_am.c:663  UCX  WARN  AM RDMA ucp_am_send_rdma_req ret=0x447f6ae8
[1567782322.339274] [f8n02:10545:0]         ucp_am.c:232  UCX  WARN  AM RDMA ucp_am_send_rdma_short header=0x000000000000003a
[1567782322.339282] [f8n02:10545:0]         ucp_am.c:234  UCX  WARN  AM RDMA ucp_am_send_rdma_short payload=(total_size=64,msg_id=0x0,ep_ptr=0,am_id=0)
[1567782322.339290] [f8n02:10545:0]         ucp_am.c:593  UCX  WARN  AM RDMA callback request=0x447f6ae8 status=0
[1567782322.339354] [f8n02:10545:0]         ucp_am.c:932  UCX  WARN  AM RDMA hdr=0x000000000000003a
[1567782322.339360] [f8n02:10545:0]         ucp_am.c:935  UCX  WARN  AM RDMA ucp_am_rdma_handler rdma_hdr=(total_size=64,msg_id=0x0,ep_ptr=0,am_id=0)
[f8n02:10545:0:10545]  ucp_worker.h:306  Assertion `ep != ((void *)0)' failed
==== backtrace ====

@yosefe what do I need to do so that the ep_ptr will be set non-zero when I need it ?

tjcw commented 4 years ago

@snyjm-18 assuming we solve the ep_ptr problem, I'm expecting to run into another problem. At the moment, I issue a uct active message call in the completion callback of the ucp_put_nb call. In previous usage of ucx, when I issued a ucp active message call in the completion callback of another ucp active message call, I got an assertion failure. Will I fail the same assertion again ? I can think of a way to avoid this; it involves a change to the client application so that there are 2 calls to ucp_am, the first one to cause the data to be transferred and the second to issue the active message call in the server. The second call can be scheduled to be issued after the ucp_worker_progress issuing the completion callback completes. I'd prefer not to have to make this change to the client application, as it makes it diverge more from the way that IBM PAMI works.

tjcw commented 4 years ago

In branch tjcw-am-rdma-deferred-test, I have moved the AM-over-RDMA tests after the AM tests . With this branch, I no longer get ep_ptr zero, but the test fails with

ucp_rkey.c:117  UCX  ERROR feature flags UCP_FEATURE_RMA|UCP_FEATURE_AMO32|UCP_FEATURE_AMO64 were not set for ucp_init()

I'll try and find the place in gtest where I need to add UCP_FEATURE_RMA.

tjcw commented 4 years ago

After a few more revisions, in branch tjcw-am-rdma-deferred-test the gtest is getting as far as trying to register the active message buffer in preparation for RDMA. I'm now getting

ib_md.c:273  UCX  ERROR ibv_exp_reg_mr(address=0x200015dd0000, length=4629771061636907072, access=0xf) failed: Cannot allocate memory

which I'm not expecting; I'm setting up the call to ucp_mem_map with

    map_params.field_mask = UCP_MEM_MAP_PARAM_FIELD_ADDRESS |
                            UCP_MEM_MAP_PARAM_FIELD_LENGTH ;
    map_params.address    = all_data + 1 ;
    map_params.length     = rdma_hdr->total_size ;
    status=ucp_mem_map(worker->context,&map_params,&memh) ;

and I have previously displayed that rdma_hdr->total_size is '64'. So what is the problem here ?

Altogether I now have 2 problems; branch tjcw-am-rdma where "./gtest --gtest_filter=ucp_am" fails with the problem with ep_ptr being zero, and branch tjcw-am-rdma-deferred-test where "./gtest --gtest_filter=ucp_am" hangs with the problem of trying to register an impossibly-long memory region.

tjcw commented 4 years ago

Both the 'gtest' and my application fail the assert with "Assertion `ep != ((void *)0)' failed" in ucp/core/ucp_worker.h:306 . @yosefe , what do I have to do so that the ep_ptr is set in the AM client so that it can be passed to the server ?

yosefe commented 4 years ago

@tjcw please see the comment inside ucp_request_get_dest_ep_ptr it means you need to call this function again inside progress function to update rdma_header

tjcw commented 4 years ago

@yosefe How do I get called again inside the progress function ?

yosefe commented 4 years ago

the request should be added to pending queue since RTS send would return UCS_ERR_NO_RESOURCE because ucp_request_resolve_dest_ep_ptr() temporarily replaces the uct_ep with wireup proxy ep The pending callback should probe ucp_request_get_dest_ep_ptr() before sending the RTS and not cache this on the request To put it simply, just get rid of rdma_header field and build it on the fly

tjcw commented 4 years ago

I made that change; my sending function now looks like

static ucs_status_t ucp_am_rdma_contig_short(uct_pending_req_t *self)
{
    ucp_request_t *req   = ucs_container_of(self, ucp_request_t, send.uct);
    uintptr_t ep_ptr = ucp_request_get_dest_ep_ptr(req) ;
    ucp_am_rdma_header_t *rdma_hdr = (ucp_am_rdma_header_t *)req->send.buffer ;
    ucs_warn("AM RDMA ucp_am_rdma_contig_short ep_ptr now=%lu", ep_ptr) ;
    rdma_hdr->ep_ptr = ep_ptr ;
    ucs_status_t status = ucp_am_send_rdma_short(req->send.ep,req->send.buffer) ;
    ucs_warn("AM RDMA ucp_am_send_rdma_short returns %d", status) ;
    if (ucs_likely(status == UCS_OK)) {
        ucp_request_complete_send(req, UCS_OK);
    }
    return status ;
}

Now I'm getting past the problem of the initial send having no reply endpoint; but the RDMA write is coming back with an Infiniband error. The error message is

[1,0]<stderr>:[c712f6n01:62585:0:62585] ib_mlx5_log.c:139  Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[1,0]<stderr>:[c712f6n01:62585:0:62585] ib_mlx5_log.c:139  RC QP 0x3d2 wqe[4]: RDMA_WRITE --- [rva 0x11004d35dfab rkey 0x1400] [inl len 2]

and the backtrace is

#6  0x000010000310fb50 in ucs_log_dispatch (
    file=0x1000120c0d00 "mlx5/ib_mlx5_log.c", line=<optimized out>, 
    function=0x1000120c0c70 <__FUNCTION__.15990> "uct_ib_mlx5_completion_with_err", level=<optimized out>, 
    format=0x1000120c0e88 "%s on %s:%d/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n%s QP 0x%x wqe[%d]: %s") at debug/log.c:191
#7  0x0000100012062388 in uct_ib_mlx5_completion_with_err (
    iface=<optimized out>, ecqe=<optimized out>, txwq=<optimized out>, 
    log_level=<optimized out>) at mlx5/ib_mlx5_log.c:132
#8  0x00001000120839e8 in uct_rc_mlx5_iface_handle_failure (
    ib_iface=0x10026a3a5e0, arg=0x1000174d0040, status=<optimized out>)
    at rc/accel/rc_mlx5_iface.c:206
#9  0x00001000120638c8 in uct_ib_mlx5_check_completion (iface=0x10026a3a5e0, 
    cq=<optimized out>, cqe=0x1000174d0040) at mlx5/ib_mlx5.c:340
#10 0x0000100012085448 in uct_ib_mlx5_poll_cq (cq=0x10026a42cb8, 
    iface=0x10026a3a5e0)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/ib/mlx5/ib_mlx5.inl:38
#11 uct_rc_mlx5_iface_poll_tx (iface=0x10026a3a5e0)
    at rc/accel/rc_mlx5_iface.c:89
#12 uct_rc_mlx5_iface_progress (arg=0x10026a3a5e0)
    at rc/accel/rc_mlx5_iface.c:124
#13 0x0000100003002c74 in ucs_callbackq_dispatch (cbq=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211
#14 uct_worker_progress (worker=<optimized out>)

@yosefe What is this error message telling me, and how do I make it not happen ?

yosefe commented 4 years ago

@tjcw pls check that va and rkey are correct and valid when the rdma write arrives also why rdma write and not rdma read?

tjcw commented 4 years ago

OK, I will check the va and rkey. On the sending side, I'm not calling ucp_mem_map to map the memory which needs to be sent with RDMA. @yosefe Do I need to do this for the IB adapter to be able to read the memory ? I was originally doing RDMA write because I expected to write 2 elements of the 'iovec' with one call to a function; IB supports an 'iovec' for local memory but not for remote memory. But the ucp function for this doesn't take an iovec parameter so this reason doesn't apply. Now, I'm doing an RDMA write because I will need to contact the 'other side' when the RDMA completes. I don't think I will be able to issue a uct active message from the RDMA completion handler callback (I have previously been unable to issue ucp active messages from AM completion handlers) so I will have to go back to the AM client and issue this call in the next 'progress' call.

yosefe commented 4 years ago

yes, for rdma_read will need to do the equivalent of ucp_mem_map on sender side the tag matching protocol today uses rdma_read (by default) with uct active message for notification rdma_read can be better because it does not require 2nd progress on sender side to start the data transfer

tjcw commented 4 years ago

I got my application test case working. Now I'm trying to get my 'gtest' working. The first AM-via-RDMA (which connects the endpoints) works, but the second AM-via-RDMA (which runs over already-connected endpoints) fails with

ud_verbs.c:149  UCX  ERROR Invalid am_short length: 170 (expected: <= 124)

This is in the AM server, where it is trying to reply to the AM client with a message which includes the 'rkey' for the memory region in the server where the AM data is to be placed. Presumably I'm supposed to use ucp_am_bcopy_single or ucp_am_zcopy_single, or something similar to these routines, rather than the 'short' function. @snyjm-18 @yosefe Which of these routines do I need to use, and is there any guidance for how to call them ? My code (with 'gtest', run it with 'gtest --gtest_filter=ucp_am') is here https://github.com/tjcw/ucx/tree/tjcw-am-rdma ( revision fc9ad54633dce56edb3434f62f442bd30fad45ff ) if you want to see what I am trying to do and the error that I get.

snyjm-18 commented 4 years ago

There are already some functions to help with this. ucp_am.c it occurs in ucp_am_send_req() and then the decision is made within ucp_request_send_start().

tjcw commented 4 years ago

I have revised my code to use a version of ucp_am_bcopy_single_reply for the 'reply' active message. Now, the dcx/test_ucp_am.send_process_iov_am/0 test runs, but the ud/test_ucp_am.send_process_iov_am/0 test hangs. I get messages

[1568648885.845669] [f8n02:4888 :0]          mpool.c:43   UCX  WARN  object 0x4f9a6a80 was not returned to mpool ucp_requests
[1568648885.845673] [f8n02:4888 :0]          mpool.c:43   UCX  WARN  object 0x4f9a6b80 was not returned to mpool ucp_requests
common/test.cc:269: Failure
Failed
Got 69 warnings during the test
[  FAILED  ] dcx/test_ucp_am.send_process_iov_am/0, where GetParam() = dc_x (238 ms)

from the first test; I presume 'gtest' thinks the test fails because I am using 'ucs_warn' to issue trace messages. The 'not returned to mpool' messages are related to the 2 'req's that are allocated to support the initial uct AMs from client to server. I think I am returning these to the pool; in one case (when endpoints aren't bound) by calling 'ucp_request_free' from the client-side callback, and in the other case (when endpoints are bound and the client completes inline) by calling ucp_request_put from ucp_am_rdma_send_req . @snyjm-18 I'd appreciate an opinion as to why UCX thinks I am not freeing these. Here is the run log of the 'gtest'. gtest.txt My code is here https://github.com/tjcw/ucx/tree/tjcw-am-rdma commit 1ff359ae23189469ea2efa4a53c62363b0ca5199 if you want to run the 'gtest' yourself.

snyjm-18 commented 4 years ago

I can look later this week.

tjcw commented 4 years ago

I have plugged some leaks, but I still get messages about request structures not being freed. I think the problem is that I don't understand the 'flags' in a request; what is supposed to set them, and which functions take what action when they are set. @yosefe how do I turn on trace so that 'ucs_trace_req' (UCS_LOG_LEVEL_TRACE_REQ) messages will be displayed ? I think with that trace on I might understand better how the req flags work. Also, I'm currently sending a packed rkey in an AM from the server to the client. I do this by copying the result of ucp_rkey_pack into a fixed-size array in the AM data. One of my gtest tests sets the packed rkey size to 159; @yosefe is there a limit to the packed rkey size, and if so what is it ? My current commit is 14d4cb9f2a3c8489850a2130afe7d3ebbbe697bb .

tjcw commented 4 years ago

I turned trace on with "export UCX_LOG_LEVEL=req", and got the following trace from running gtest. gtest.txt Hunting through this for a leaked 'req' gives

[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]> fgrep 0x1b396f80 gtest.log.3
[1568723281.072152] [f8n02:77178:0]         ucp_am.c:702  UCX  REQ   allocated request 0x1b396f80
[1568723281.072154] [f8n02:77178:0]         ucp_am.c:707  UCX  WARN  ucp_am_rdma_send_nb req=0x1b396f80
[1568723281.072175] [f8n02:77178:0]  ucp_request.inl:95   UCX  REQ   completing send request 0x1b396f80 (0x1b397068) ------- Success
[1568723281.072178] [f8n02:77178:0]         ucp_am.c:579  UCX  REQ   releasing send request 0x1b396f80, returning status Success
[1568723281.072180] [f8n02:77178:0]  ucp_request.inl:85   UCX  REQ   put request 0x1b396f80
[1568723281.122085] [f8n02:77178:0]          mpool.c:43   UCX  WARN  object 0x1b396f80 was not returned to mpool ucp_requests
[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]>

It looks to me that this 'req' is being allocated and then freed; so @yosefe why is ucx indicating that the 'req' was not returned to the mpool ?

tjcw commented 4 years ago

Running 'gtest' with "export UCX_LOG_LEVEL=req" seems to show that the reqs are leaked when they are used for ucp_put_nb. Searching for a couple of leaked addresses in a trace shows

[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]> egrep "0x20708f00|0x20709000" gtest.log.5
[1568724951.562096] [f8n02:81196:0]         ucp_am.c:700  UCX  REQ   allocated request 0x20709000
[1568724951.562098] [f8n02:81196:0]         ucp_am.c:705  UCX  TRACE ucp_am_rdma_send_nb req=0x20709000
[1568724951.562541] [f8n02:81196:0]  ucp_request.inl:95   UCX  REQ   completing send request 0x20709000 (0x207090e8) ----c-- Success
[1568724951.562549] [f8n02:81196:0]    ucp_request.c:77   UCX  REQ   free request 0x20709000 (0x207090e8) ----c--
[1568724951.562551] [f8n02:81196:0]  ucp_request.inl:85   UCX  REQ   put request 0x20709000
[1568724951.562707] [f8n02:81196:0]       rma_send.c:184  UCX  REQ   allocated request 0x20709000
[1568724951.562721] [f8n02:81196:0]    ucp_request.c:195  UCX  REQ   req 0x20709000: mem dereg buffer 0/1 md_map 0x0
[1568724951.562725] [f8n02:81196:0]  ucp_request.inl:95   UCX  REQ   completing send request 0x20709000 (0x207090e8) ------- Success
[1568724951.562730] [f8n02:81196:0]          rma.inl:24   UCX  REQ   releasing send request 0x20709000, returning status Success
[1568724951.563222] [f8n02:81196:0]         ucp_am.c:700  UCX  REQ   allocated request 0x20708f00
[1568724951.563225] [f8n02:81196:0]         ucp_am.c:705  UCX  TRACE ucp_am_rdma_send_nb req=0x20708f00
[1568724951.563266] [f8n02:81196:0]  ucp_request.inl:95   UCX  REQ   completing send request 0x20708f00 (0x20708fe8) ------- Success
[1568724951.563271] [f8n02:81196:0]         ucp_am.c:577  UCX  REQ   releasing send request 0x20708f00, returning status Success
[1568724951.563274] [f8n02:81196:0]  ucp_request.inl:85   UCX  REQ   put request 0x20708f00
[1568724951.563472] [f8n02:81196:0]       rma_send.c:184  UCX  REQ   allocated request 0x20708f00
[1568724951.563484] [f8n02:81196:0]    ucp_request.c:195  UCX  REQ   req 0x20708f00: mem dereg buffer 0/1 md_map 0x0
[1568724951.563489] [f8n02:81196:0]  ucp_request.inl:95   UCX  REQ   completing send request 0x20708f00 (0x20708fe8) ------- Success
[1568724951.563493] [f8n02:81196:0]          rma.inl:24   UCX  REQ   releasing send request 0x20708f00, returning status Success
[1568724951.580135] [f8n02:81196:0]          mpool.c:43   UCX  WARN  object 0x20708f00 was not returned to mpool ucp_requests
[1568724951.580137] [f8n02:81196:0]          mpool.c:43   UCX  WARN  object 0x20709000 was not returned to mpool ucp_requests
[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]>

Looking at file rma.inl shows

static UCS_F_ALWAYS_INLINE ucs_status_ptr_t
ucp_rma_send_request_cb(ucp_request_t *req, ucp_send_callback_t cb)
{
    ucs_status_t status = ucp_request_send(req, 0);

    if (req->flags & UCP_REQUEST_FLAG_COMPLETED) {
        ucs_trace_req("releasing send request %p, returning status %s", req,
                      ucs_status_string(status));
        ucs_mpool_put(req);
        return UCS_STATUS_PTR(status);
    }

which looks correct, the 'req' is returned to the pool with ucs_mpool_put. So why is the 'req' leaking ?

tjcw commented 4 years ago

I found a problem; I was reusing a 'req', and ended up freeing it twice. I'm working on fixing that; the current problem is a hang. But I think the problem is in my code at the moment.

tjcw commented 4 years ago

I have my code functional now, both for the application and the 'gtest'. As an aside, the 'gtest' for active messages doesn't actually test that the callback works; it relies on ucp_request_check_status to determine whether the AM is still in progress. Next to explore whether the code performs the way I want for nonblocking MPI collective calls, and to clean the code up so that it conforms with UCX coding guidelines. I'm currently at commit 6d80414d071d85501828472023dd1d2691cb66eb .

tjcw commented 4 years ago

I'm doing some more testing. Trying active message over RDMA and using ucp_put_nb for a 1-byte memory region, I find that often the target byte is not written with the correct value. @yosefe are there any restrictions on the lengths that can be carried correctly by ucp_put_nb ? On doing more testing, I'm finding that the target address of the ucp_put_nb is always being left unmodified. I also frequently get messages from 'glibc' indicating that the heap has been corrupted, suggesting that the data is being written to a different address from that specified in the ucp_put_nb (or possibly being deferred in such a way that the 'completinn' uct am is overtaking the RDMA put).

tjcw commented 4 years ago

After more testing, and some fixing of my code, I think there is a problem with in-memory transport when doing RDMA 'put'. I have added a checker in the server, to look at the first byte and the last byte of the memoty block transferred; this checker is telling me that the fist byte is not what it should be. The assert message is

[1,2]<stderr>:[c712f6n01:45008:0:45008]      ucp_am.c:1368 Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed

and the backtrace is

(gdb) where
#0  0x000010000062f568 in pause () from /lib64/libc.so.6
#1  0x00001000030faae4 in ucs_debug_freeze () at debug/debug.c:710
#2  0x00001000030fe5f0 in ucs_error_freeze (message=0x3fffc617d2a0 "Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed") at debug/debug.c:829
#3  ucs_handle_error (message=0x3fffc617d2a0 "Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed") at debug/debug.c:992
#4  0x00001000030fa898 in ucs_fatal_error_message (file=0x10000304a118 "core/ucp_am.c", line=<optimized out>, function=<optimized out>, 
    message_buf=0x3fffc617d2a0 "Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed") at debug/assert.c:33
#5  0x00001000030faa64 in ucs_fatal_error_format (file=0x10000304a118 "core/ucp_am.c", line=<optimized out>, 
    function=0x100003049db0 <__FUNCTION__.14370> "ucp_am_rdma_completion_handler", format=<optimized out>) at debug/assert.c:49
#6  0x0000100002fd8998 in ucp_am_rdma_completion_handler (am_arg=0x100021390010, am_data=<optimized out>, am_length=<optimized out>, am_flags=<optimized out>)
    at core/ucp_am.c:1368
#7  0x0000100003095edc in uct_iface_invoke_am (flags=0, length=24, data=<optimized out>, id=<optimized out>, iface=0x1000213f0680)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/base/uct_iface.h:628
#8  uct_mm_iface_invoke_am (flags=0, length=24, data=<optimized out>, am_id=<optimized out>, iface=0x1000213f0680) at sm/mm/base/mm_iface.h:151
#9  uct_mm_iface_process_recv (iface=0x1001e653c10, elem=0x1000213f0680) at sm/mm/base/mm_iface.c:208
#10 0x00001000030961dc in uct_mm_iface_poll_fifo (iface=0x1001e653c10) at sm/mm/base/mm_iface.c:254
#11 uct_mm_iface_progress (arg=0x1001e653c10) at sm/mm/base/mm_iface.c:278
#12 0x0000100002ff5904 in ucs_callbackq_dispatch (cbq=<optimized out>) at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211
#13 uct_worker_progress (worker=<optimized out>) at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/api/uct.h:2203
#14 ucp_worker_progress (worker=0x100021390010) at core/ucp_worker.c:1892
#15 0x0000100002ccfa9c in LibColl::Adapter::UCXContext::advance (this=0x1001e9ad580) at ../../adapter/ucx/UCXContext.h:334
#16 0x0000100002cb9318 in LIBCOLL_Advance (ctxt=<optimized out>) at libcoll.cc:161
#17 0x0000100002ee2748 in blocking_coll (context=0x1001e9ad580, coll=0x3fffc618f590, active=0x3fffc618f74c) at ../collsel/init_util.h:130
#18 0x0000100002eefbe4 in LibColl::AdvisorTable::generate (this=0x1001ebce8b0, filename=0x0, params=0x3fffc618fbd0, ops=0x100002bf17b0 <external_geometry_ops>, mode=<optimized out>) at ../collsel/AdvisorTable.h:598
#19 0x0000100002ee59f0 in LibColl::CollselExtension::Collsel_table_generate (advisor=<optimized out>, filename=0x0, params=<optimized out>, ops=<optimized out>, 
    mode=<optimized out>) at ../collsel/CollselExtension.cc:95
#20 0x0000100002ee5a48 in Collsel_table_generate (advisor=<optimized out>, filename=<optimized out>, params=<optimized out>, ops=<optimized out>, mode=<optimized out>)
    at CollselExtension.cc:26
#21 0x0000100002bcc4f0 in mca_coll_ibm_tune () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/spectrum_mpi/mca_coll_ibm.so
#22 0x0000100002ba4550 in mca_coll_ibm_hook_at_mpi_init_bottom () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/spectrum_mpi/mca_coll_ibm.so
#23 0x00001000001cb120 in ompi_hook_base_mpi_init_bottom () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#24 0x0000100000149ebc in ompi_mpi_init () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#25 0x000010000017f8f8 in PMPI_Init () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#26 0x0000000010000888 in main ()

The test is failing close to the beginning of the application run, so it is likely that it is failing on a 1-byte ucp_put_nb. Tomorrow I will turn on trace and report more about what is happening. @yosefe are there any known issues with in-memory transport ? When I configured the test to run with 2 processes on different nodes (over IB), it ran cleanly.

tjcw commented 4 years ago

I turned on UCX logging to investigate. The failing transfer was a 1-byte 'ucp_put_nb' over the in-memory (knem) transport. The data was either not written, or written to a different address than that specified by the ucp_put_nb. The ucp_put_nb returned with result UCS_OK, indicating that the 'put' had been done synchronously and ucp_put_nb thought it had succeeded. The receiving side had a memory registration for 33 bytes based 32 bytes below the RDMA target address, and the sending side had a memory registration for 1 byte at the RDMA source address. This is using 'knem' on IBM POWER8. I will adjust my test case to find whether in-memory transfers for larger than 1 byte are working.

openucx / ucx

ucp active messages take an excessive number of ucp_worker_progress calls #4036