Open tjcw opened 4 years ago
I agree that seems like an excessive number of calls. Are you checking the return value of ucp_worker_progress?
I'm currently ignoring the return value. What would you like me to do with the return value ?
The return value indicates how much work it did. I believe if it returns 0, no work was done, so the send has completed, so there is not reason to keep calling it. @yosefe @shamisp please confirm.
I am now reporting how many times ucp_worker_progress comes back with a non-zero return code; see 'progressedCount' in the new trace file here run_abbrev_2.log . The 'progressedCount' values are much lower than the 'advancecount' values. The application keeps calling ucp_worker_progress (each time after doing its floating-point work) until the MPI_Ibcast completes; even if one call to ucp_worker_progress comes back zero, a transfer can complete on the Infiniband (incoming or outgoing) which will enable a subsequent ucp_worker_progress call to make progress. So I still don't know why IBM PAMI requires so many fewer calls to 'advance' to complete the MPI_Ibcast. Some of this may be that IBM PAMI establishes all communication endpoints at startup, whereas my implementation on UCX defers establishing communication endpoints until the first time a rank needs to send an active message to another rank; but this is not the whole story.
I'm interested in where the time is going. It doesn't seem like its on the send side and may be on the receive side. If its on the receive side, it may be related to your other open issue.
Here is the trace from the same test but using IBM PAMI run_pami_abbrev.log . I haven't annotated it with 'advance' counts because in this case progress is done by a different method which makes it difficult to count. The trace shows that transfers over the IB are being done in 32MB units, i.e. the application broadcast buffer is being transferred as a single chunk. In the UCX run, are the transfers done with the 'iovec' as a single chunk, or is it split down into smaller blocks with a ucp_worker_progress call needed for each block ?
The iovec will not be transferred as one chunk because 32MB won't fit in a single packet. It should not require a ucp_worker_progress call for each packet. If it required a call for each packet, the number of non-zero ucp_worker_progress calls would be much higher than those in run_abbrev_2.log. I am not 100% on this and @yosefe or @shamisp should confirm.
By comparison with the IBM PAMI implementation, the UCP implementation requires about 10x as > many calls to 'advance' (ucp_worker_progress). This results in a 10x longer elapsed time to > complete the broadcasts. I am running with a 'release' build of UCX from the current master branch.
Each ucp_worker_progress call processes a single packet. 32MB would be split to ~4000 packets of 8k over a typical IB transport. However, the mere overhead of progress function call is very small compared to the time it takes to receive a 8k message, so i highly doubt this is the root cause. If the frequency of calling progress/advance is too low, that is - a better overlap between communication and computation is required - i'd suggest to implement large UCP active messages over RDMA, or use the tagged API for the collectives library broadcast. Or, call ucp_worker_progress() multiple times from the equivalent of PAMI_Context_Advance
@yosefe on sender side you don't have to generate completion for every send
As @yosefe suggested, I think I need to implement large UCP active messages over RDMA. I think the way to do this is to take a copy of src/ucp/tag/rndv.c and revise it so that it works for active messages. I have started a branch here https://github.com/tjcw/ucx/tree/tjcw-am-rdma for this work. I am interested in implementing for the case where the client passes an 'iovec' to the active message call. I think I will end up implementing another function with the same signature as 'ucp_am_send_nb' for the client to call if it wants to issue an active message with the data transferred by RDMA.
I am assuming that more frequent ucp_worker_progress() calls did not solve the performance issue.
If you need any help, please let me know. I would try to keep the API the same, and only add a flag if possible. Although it seems you need a new function call to specify the RDMA region.
For many applications, it will not be practical to call ucp_worker_progress() more frequently; the application will be coded to do a certain amount of work, then call MPI_Test*() which will call opal_progress() which will call ucp_worker_progress() amongst other things.
I have started work on active messages via RDMA. Code is in https://github.com/tjcw/ucx/tree/tjcw-am-rdma . I am intending to implement UCP AM as 3 UCT active messages; a 'request' flowing from the client to the server, a 'response' flowing from the server to the client; and a 'completion' flowing from the client to the server. The client will perform an RDMA write between the 'response' and 'completion' messages. Most of my revisions so far are in ucp_am.h ; the revised section looks like
typedef struct {
size_t total_size; /* length of buffer needed for all data */
uint64_t msg_id; /* method to match parts of the same AM */
uintptr_t ep; /* end point ptr, used for maintaing list
of arrivals */
uint16_t am_id; /* index into callback array */
} UCS_S_PACKED ucp_am_rdma_header_t ;
enum {
packed_rkey_max_size = 32 /* Max supported size for a packed rkey */
};
typedef struct {
uint64_t msg_id; /* method to match parts of the same AM */
uintptr_t ep; /* end point ptr, used for maintaing list
of arrivals */
uint16_t am_id; /* index into callback array */
uintptr_t address; /* Address for RDMA */
char rkey_buffer[packed_rkey_max_size] ; /* Packed remote key */
} UCS_S_PACKED ucp_am_rdma_reply_header_t ;
typedef struct {
uint64_t msg_id; /* method to match parts of the same AM */
uintptr_t ep; /* end point ptr, used for maintaing list
of arrivals */
uint16_t am_id; /* index into callback array */
} UCS_S_PACKED ucp_am_rdma_completion_header_t ;
typedef struct {
ucs_list_link_t list; /* entry into list of unfinished AM's */
ucp_request_t *req; /* active message request */
uint64_t msg_id; /* way to match up all parts of AM */
} ucp_am_rdma_client_unfinished_t ;
typedef struct {
ucs_list_link_t list; /* entry into list of unfinished AM's */
ucp_recv_desc_t *all_data; /* buffer for all parts of the AM */
uint64_t msg_id; /* way to match up all parts of AM */
} ucp_am_rdma_server_unfinished_t ;
@snyjm-18 Does this look reasonable ? Also is there a ucp function call to do the RDMA ?
Looks mostly fine. Couple small things. packed_rkey_max_size -> UCP_PACKED_RKEY_MAX_SIZE There are some other style things that need to be changed: https://github.com/openucx/ucx/blob/master/doc/CodeStyle.md
For ucp_am_rdma_client_unfinished_t, this is to maintain a list of requests inbetween request and response messages on the client side? If so, it all makes sense to me.
Yossi is probably the best person to ask about ucp functions to do the rdma. May want to look at upc/rma/rma_send.c . Looks like ucp_rma_nonblocking(...) may be what you are looking for.
I have a first pass at the code, but it fails an assert in the AM server where it tries to find the endpoint of the client to save context and with a view to replying. My code is here https://github.com/tjcw/ucx/blob/tjcw-am-rdma/src/ucp/core/ucp_am.c ; the assert message is
[c712f6n02:60277:0:60277] ucp_worker.h:308 Assertion `ep->worker == worker' failed: worker=0x100012220010 ep=0x10001bff0000 ep->worker=(nil)
and the piece of my code which trips the assert is
static ucs_status_t
ucp_am_rdma_handler(void *am_arg, void *am_data, size_t am_length,
unsigned am_flags)
{
ucp_worker_h worker = (ucp_worker_h)am_arg;
ucp_am_hdr_t *hdr = (ucp_am_hdr_t *)am_data;
ucp_am_rdma_header_t *rdma_hdr = (ucp_am_rdma_header_t *)(hdr+1);
ucp_ep_h ep = ucp_worker_get_ep_by_ptr(worker,
rdma_hdr->ep);
@yosefe @snyjm-18 I could do with some help figuring out what I am doing wrong. I'm assuming the UCT active message buffer on the server will contain the 8-byte item passed as 'header' to uct_ep_am_short, followed by the payload as passed to uct_ep_am_short, but I haven't seen this documented so I may be wrong. Also, where does the output from 'ucs_print' go ?
It looks like the problem is that I didn't understand what ucp_worker_get_ep_by_ptr does. So @yosefe @snyjm-18 how does the server of an AM find the endpoint of the client, if it needs to reply ?
@tjcw that is what ucp_worker_get_ep_by_ptr does. There are examples to use it in the current ucp_am implementation, and also in eager_send.c for implementing ssend. I looked at your branch, where does rdma_hdr->ep get set before sending?
@snyjm-18 I have this code
unfinished->rdma_header.ep = (uintptr_t) ep ;
at line 652 of ucp_am.c, but this sets the 'ep' field to the server's endpoint in the client address space --- which is wrong. It needs to be the client's address in the server address space. How do I find this value ?
reply_hdr->ep_ptr = ucp_request_get_dest_ep_ptr(req);
on the client and then
reply_ep = ucp_worker_get_ep_by_ptr(worker, hdr->ep_ptr);
on the server.
I've made the change for
reply_hdr->ep_ptr = ucp_request_get_dest_ep_ptr(req);
but ucp_request_get_dest_ep_ptr is coming back with '0', and so in the AM server I am failing the assert at line 306. What should be setting up the value in the 'req' ?
Just try
status = ucp_ep_resolve_dest_ep_ptr(ep, ep->am_lane);
on the client side before calling ucp_am_send_req. It's used line 455 in ucp_am.c
It looks like you are already doing that. I can take a closer look tonight.
My code looks like
ucp_am_send_req_init(req, ep, &(unfinished->rdma_header), UCP_DATATYPE_CONTIG, sizeof(ucp_am_rdma_header_t), flags, id);
status = ucp_ep_resolve_dest_ep_ptr(ep, ep->am_lane);
if (ucs_unlikely(status != UCS_OK)) {
ret = UCS_STATUS_PTR(status);
goto out;
}
length = ucp_dt_iov_length(iovec, count);
unfinished->rdma_header.total_size = length ;
unfinished->rdma_header.msg_id = req->send.am.message_id ;
unfinished->rdma_header.ep_ptr = ucp_request_get_dest_ep_ptr(req) ;
( https://github.com/tjcw/ucx/blob/tjcw-am-rdma/src/ucp/core/ucp_am.c#L642 ) . but I'm still getting ep_ptr being 0. What could make ucp_ep_resolve_dest_ep_ptr not to its job but still come back with UCS_OK ?
Does the worker progress function need to be called a few times for the wireup to complete and the destination ep_ptr to become valid ? In my case I'm trying to do the AM-over-RDMA as the first application communication to be done between these communication partners. I'll try making a modification to the active message 'gtest' to flow a test active message over RDMA, to give you a test case.
That is a question for Yossi, but I think if wireup is not completed, it may not know the ep_ptr yet.
@snyjm-18 I have revised test_ucp_am::do_send_process_data_iov_test to add an AM-over-RDMA test. I'm running all the gtests at the moment; how can I run just the active message gtest ?
./gtest --gtest_filter=ucp_am*
I've figured out how to run just the ucp am tests
./gtest --gtest_filter=*ucp_am*
does it. Running this with my modified gtest fails the same assert that I reported above;
[ RUN ] dcx/test_ucp_am.send_process_iov_am/0
[1567782322.337650] [f8n02:10545:0] ucp_am.c:611 UCX WARN AM RDMA am_id=0
[1567782322.338459] [f8n02:10545:0] ucp_am.c:232 UCX WARN AM RDMA ucp_am_send_rdma_short header=0x000000000000003a
[1567782322.338469] [f8n02:10545:0] ucp_am.c:234 UCX WARN AM RDMA ucp_am_send_rdma_short payload=(total_size=64,msg_id=0x0,ep_ptr=0,am_id=0)
[1567782322.338483] [f8n02:10545:0] ucp_am.c:663 UCX WARN AM RDMA ucp_am_send_rdma_req ret=0x447f6ae8
[1567782322.339274] [f8n02:10545:0] ucp_am.c:232 UCX WARN AM RDMA ucp_am_send_rdma_short header=0x000000000000003a
[1567782322.339282] [f8n02:10545:0] ucp_am.c:234 UCX WARN AM RDMA ucp_am_send_rdma_short payload=(total_size=64,msg_id=0x0,ep_ptr=0,am_id=0)
[1567782322.339290] [f8n02:10545:0] ucp_am.c:593 UCX WARN AM RDMA callback request=0x447f6ae8 status=0
[1567782322.339354] [f8n02:10545:0] ucp_am.c:932 UCX WARN AM RDMA hdr=0x000000000000003a
[1567782322.339360] [f8n02:10545:0] ucp_am.c:935 UCX WARN AM RDMA ucp_am_rdma_handler rdma_hdr=(total_size=64,msg_id=0x0,ep_ptr=0,am_id=0)
[f8n02:10545:0:10545] ucp_worker.h:306 Assertion `ep != ((void *)0)' failed
==== backtrace ====
@yosefe what do I need to do so that the ep_ptr will be set non-zero when I need it ?
@snyjm-18 assuming we solve the ep_ptr problem, I'm expecting to run into another problem. At the moment, I issue a uct active message call in the completion callback of the ucp_put_nb call. In previous usage of ucx, when I issued a ucp active message call in the completion callback of another ucp active message call, I got an assertion failure. Will I fail the same assertion again ? I can think of a way to avoid this; it involves a change to the client application so that there are 2 calls to ucp_am, the first one to cause the data to be transferred and the second to issue the active message call in the server. The second call can be scheduled to be issued after the ucp_worker_progress issuing the completion callback completes. I'd prefer not to have to make this change to the client application, as it makes it diverge more from the way that IBM PAMI works.
In branch tjcw-am-rdma-deferred-test, I have moved the AM-over-RDMA tests after the AM tests . With this branch, I no longer get ep_ptr zero, but the test fails with
ucp_rkey.c:117 UCX ERROR feature flags UCP_FEATURE_RMA|UCP_FEATURE_AMO32|UCP_FEATURE_AMO64 were not set for ucp_init()
I'll try and find the place in gtest where I need to add UCP_FEATURE_RMA.
After a few more revisions, in branch tjcw-am-rdma-deferred-test the gtest is getting as far as trying to register the active message buffer in preparation for RDMA. I'm now getting
ib_md.c:273 UCX ERROR ibv_exp_reg_mr(address=0x200015dd0000, length=4629771061636907072, access=0xf) failed: Cannot allocate memory
which I'm not expecting; I'm setting up the call to ucp_mem_map with
map_params.field_mask = UCP_MEM_MAP_PARAM_FIELD_ADDRESS |
UCP_MEM_MAP_PARAM_FIELD_LENGTH ;
map_params.address = all_data + 1 ;
map_params.length = rdma_hdr->total_size ;
status=ucp_mem_map(worker->context,&map_params,&memh) ;
and I have previously displayed that rdma_hdr->total_size is '64'. So what is the problem here ?
Altogether I now have 2 problems; branch tjcw-am-rdma where "./gtest --gtest_filter=ucp_am" fails with the problem with ep_ptr being zero, and branch tjcw-am-rdma-deferred-test where "./gtest --gtest_filter=ucp_am" hangs with the problem of trying to register an impossibly-long memory region.
Both the 'gtest' and my application fail the assert with "Assertion `ep != ((void *)0)' failed" in ucp/core/ucp_worker.h:306 . @yosefe , what do I have to do so that the ep_ptr is set in the AM client so that it can be passed to the server ?
@tjcw please see the comment inside ucp_request_get_dest_ep_ptr
it means you need to call this function again inside progress function to update rdma_header
@yosefe How do I get called again inside the progress function ?
the request should be added to pending queue since RTS send would return UCS_ERR_NO_RESOURCE because ucp_request_resolve_dest_ep_ptr() temporarily replaces the uct_ep with wireup proxy ep The pending callback should probe ucp_request_get_dest_ep_ptr() before sending the RTS and not cache this on the request To put it simply, just get rid of rdma_header field and build it on the fly
I made that change; my sending function now looks like
static ucs_status_t ucp_am_rdma_contig_short(uct_pending_req_t *self)
{
ucp_request_t *req = ucs_container_of(self, ucp_request_t, send.uct);
uintptr_t ep_ptr = ucp_request_get_dest_ep_ptr(req) ;
ucp_am_rdma_header_t *rdma_hdr = (ucp_am_rdma_header_t *)req->send.buffer ;
ucs_warn("AM RDMA ucp_am_rdma_contig_short ep_ptr now=%lu", ep_ptr) ;
rdma_hdr->ep_ptr = ep_ptr ;
ucs_status_t status = ucp_am_send_rdma_short(req->send.ep,req->send.buffer) ;
ucs_warn("AM RDMA ucp_am_send_rdma_short returns %d", status) ;
if (ucs_likely(status == UCS_OK)) {
ucp_request_complete_send(req, UCS_OK);
}
return status ;
}
Now I'm getting past the problem of the initial send having no reply endpoint; but the RDMA write is coming back with an Infiniband error. The error message is
[1,0]<stderr>:[c712f6n01:62585:0:62585] ib_mlx5_log.c:139 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[1,0]<stderr>:[c712f6n01:62585:0:62585] ib_mlx5_log.c:139 RC QP 0x3d2 wqe[4]: RDMA_WRITE --- [rva 0x11004d35dfab rkey 0x1400] [inl len 2]
and the backtrace is
#6 0x000010000310fb50 in ucs_log_dispatch (
file=0x1000120c0d00 "mlx5/ib_mlx5_log.c", line=<optimized out>,
function=0x1000120c0c70 <__FUNCTION__.15990> "uct_ib_mlx5_completion_with_err", level=<optimized out>,
format=0x1000120c0e88 "%s on %s:%d/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n%s QP 0x%x wqe[%d]: %s") at debug/log.c:191
#7 0x0000100012062388 in uct_ib_mlx5_completion_with_err (
iface=<optimized out>, ecqe=<optimized out>, txwq=<optimized out>,
log_level=<optimized out>) at mlx5/ib_mlx5_log.c:132
#8 0x00001000120839e8 in uct_rc_mlx5_iface_handle_failure (
ib_iface=0x10026a3a5e0, arg=0x1000174d0040, status=<optimized out>)
at rc/accel/rc_mlx5_iface.c:206
#9 0x00001000120638c8 in uct_ib_mlx5_check_completion (iface=0x10026a3a5e0,
cq=<optimized out>, cqe=0x1000174d0040) at mlx5/ib_mlx5.c:340
#10 0x0000100012085448 in uct_ib_mlx5_poll_cq (cq=0x10026a42cb8,
iface=0x10026a3a5e0)
at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/ib/mlx5/ib_mlx5.inl:38
#11 uct_rc_mlx5_iface_poll_tx (iface=0x10026a3a5e0)
at rc/accel/rc_mlx5_iface.c:89
#12 uct_rc_mlx5_iface_progress (arg=0x10026a3a5e0)
at rc/accel/rc_mlx5_iface.c:124
#13 0x0000100003002c74 in ucs_callbackq_dispatch (cbq=<optimized out>)
at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211
#14 uct_worker_progress (worker=<optimized out>)
@yosefe What is this error message telling me, and how do I make it not happen ?
@tjcw pls check that va and rkey are correct and valid when the rdma write arrives also why rdma write and not rdma read?
OK, I will check the va and rkey. On the sending side, I'm not calling ucp_mem_map to map the memory which needs to be sent with RDMA. @yosefe Do I need to do this for the IB adapter to be able to read the memory ? I was originally doing RDMA write because I expected to write 2 elements of the 'iovec' with one call to a function; IB supports an 'iovec' for local memory but not for remote memory. But the ucp function for this doesn't take an iovec parameter so this reason doesn't apply. Now, I'm doing an RDMA write because I will need to contact the 'other side' when the RDMA completes. I don't think I will be able to issue a uct active message from the RDMA completion handler callback (I have previously been unable to issue ucp active messages from AM completion handlers) so I will have to go back to the AM client and issue this call in the next 'progress' call.
yes, for rdma_read will need to do the equivalent of ucp_mem_map on sender side the tag matching protocol today uses rdma_read (by default) with uct active message for notification rdma_read can be better because it does not require 2nd progress on sender side to start the data transfer
I got my application test case working. Now I'm trying to get my 'gtest' working. The first AM-via-RDMA (which connects the endpoints) works, but the second AM-via-RDMA (which runs over already-connected endpoints) fails with
ud_verbs.c:149 UCX ERROR Invalid am_short length: 170 (expected: <= 124)
This is in the AM server, where it is trying to reply to the AM client with a message which includes the 'rkey' for the memory region in the server where the AM data is to be placed. Presumably I'm supposed to use ucp_am_bcopy_single or ucp_am_zcopy_single, or something similar to these routines, rather than the 'short' function. @snyjm-18 @yosefe Which of these routines do I need to use, and is there any guidance for how to call them ? My code (with 'gtest', run it with 'gtest --gtest_filter=ucp_am') is here https://github.com/tjcw/ucx/tree/tjcw-am-rdma ( revision fc9ad54633dce56edb3434f62f442bd30fad45ff ) if you want to see what I am trying to do and the error that I get.
There are already some functions to help with this. ucp_am.c it occurs in ucp_am_send_req() and then the decision is made within ucp_request_send_start().
I have revised my code to use a version of ucp_am_bcopy_single_reply for the 'reply' active message. Now, the dcx/test_ucp_am.send_process_iov_am/0 test runs, but the ud/test_ucp_am.send_process_iov_am/0 test hangs. I get messages
[1568648885.845669] [f8n02:4888 :0] mpool.c:43 UCX WARN object 0x4f9a6a80 was not returned to mpool ucp_requests
[1568648885.845673] [f8n02:4888 :0] mpool.c:43 UCX WARN object 0x4f9a6b80 was not returned to mpool ucp_requests
common/test.cc:269: Failure
Failed
Got 69 warnings during the test
[ FAILED ] dcx/test_ucp_am.send_process_iov_am/0, where GetParam() = dc_x (238 ms)
from the first test; I presume 'gtest' thinks the test fails because I am using 'ucs_warn' to issue trace messages. The 'not returned to mpool' messages are related to the 2 'req's that are allocated to support the initial uct AMs from client to server. I think I am returning these to the pool; in one case (when endpoints aren't bound) by calling 'ucp_request_free' from the client-side callback, and in the other case (when endpoints are bound and the client completes inline) by calling ucp_request_put from ucp_am_rdma_send_req . @snyjm-18 I'd appreciate an opinion as to why UCX thinks I am not freeing these. Here is the run log of the 'gtest'. gtest.txt My code is here https://github.com/tjcw/ucx/tree/tjcw-am-rdma commit 1ff359ae23189469ea2efa4a53c62363b0ca5199 if you want to run the 'gtest' yourself.
I can look later this week.
I have plugged some leaks, but I still get messages about request structures not being freed. I think the problem is that I don't understand the 'flags' in a request; what is supposed to set them, and which functions take what action when they are set. @yosefe how do I turn on trace so that 'ucs_trace_req' (UCS_LOG_LEVEL_TRACE_REQ) messages will be displayed ? I think with that trace on I might understand better how the req flags work. Also, I'm currently sending a packed rkey in an AM from the server to the client. I do this by copying the result of ucp_rkey_pack into a fixed-size array in the AM data. One of my gtest tests sets the packed rkey size to 159; @yosefe is there a limit to the packed rkey size, and if so what is it ? My current commit is 14d4cb9f2a3c8489850a2130afe7d3ebbbe697bb .
I turned trace on with "export UCX_LOG_LEVEL=req", and got the following trace from running gtest. gtest.txt Hunting through this for a leaked 'req' gives
[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]> fgrep 0x1b396f80 gtest.log.3
[1568723281.072152] [f8n02:77178:0] ucp_am.c:702 UCX REQ allocated request 0x1b396f80
[1568723281.072154] [f8n02:77178:0] ucp_am.c:707 UCX WARN ucp_am_rdma_send_nb req=0x1b396f80
[1568723281.072175] [f8n02:77178:0] ucp_request.inl:95 UCX REQ completing send request 0x1b396f80 (0x1b397068) ------- Success
[1568723281.072178] [f8n02:77178:0] ucp_am.c:579 UCX REQ releasing send request 0x1b396f80, returning status Success
[1568723281.072180] [f8n02:77178:0] ucp_request.inl:85 UCX REQ put request 0x1b396f80
[1568723281.122085] [f8n02:77178:0] mpool.c:43 UCX WARN object 0x1b396f80 was not returned to mpool ucp_requests
[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]>
It looks to me that this 'req' is being allocated and then freed; so @yosefe why is ucx indicating that the 'req' was not returned to the mpool ?
Running 'gtest' with "export UCX_LOG_LEVEL=req" seems to show that the reqs are leaked when they are used for ucp_put_nb. Searching for a couple of leaked addresses in a trace shows
[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]> egrep "0x20708f00|0x20709000" gtest.log.5
[1568724951.562096] [f8n02:81196:0] ucp_am.c:700 UCX REQ allocated request 0x20709000
[1568724951.562098] [f8n02:81196:0] ucp_am.c:705 UCX TRACE ucp_am_rdma_send_nb req=0x20709000
[1568724951.562541] [f8n02:81196:0] ucp_request.inl:95 UCX REQ completing send request 0x20709000 (0x207090e8) ----c-- Success
[1568724951.562549] [f8n02:81196:0] ucp_request.c:77 UCX REQ free request 0x20709000 (0x207090e8) ----c--
[1568724951.562551] [f8n02:81196:0] ucp_request.inl:85 UCX REQ put request 0x20709000
[1568724951.562707] [f8n02:81196:0] rma_send.c:184 UCX REQ allocated request 0x20709000
[1568724951.562721] [f8n02:81196:0] ucp_request.c:195 UCX REQ req 0x20709000: mem dereg buffer 0/1 md_map 0x0
[1568724951.562725] [f8n02:81196:0] ucp_request.inl:95 UCX REQ completing send request 0x20709000 (0x207090e8) ------- Success
[1568724951.562730] [f8n02:81196:0] rma.inl:24 UCX REQ releasing send request 0x20709000, returning status Success
[1568724951.563222] [f8n02:81196:0] ucp_am.c:700 UCX REQ allocated request 0x20708f00
[1568724951.563225] [f8n02:81196:0] ucp_am.c:705 UCX TRACE ucp_am_rdma_send_nb req=0x20708f00
[1568724951.563266] [f8n02:81196:0] ucp_request.inl:95 UCX REQ completing send request 0x20708f00 (0x20708fe8) ------- Success
[1568724951.563271] [f8n02:81196:0] ucp_am.c:577 UCX REQ releasing send request 0x20708f00, returning status Success
[1568724951.563274] [f8n02:81196:0] ucp_request.inl:85 UCX REQ put request 0x20708f00
[1568724951.563472] [f8n02:81196:0] rma_send.c:184 UCX REQ allocated request 0x20708f00
[1568724951.563484] [f8n02:81196:0] ucp_request.c:195 UCX REQ req 0x20708f00: mem dereg buffer 0/1 md_map 0x0
[1568724951.563489] [f8n02:81196:0] ucp_request.inl:95 UCX REQ completing send request 0x20708f00 (0x20708fe8) ------- Success
[1568724951.563493] [f8n02:81196:0] rma.inl:24 UCX REQ releasing send request 0x20708f00, returning status Success
[1568724951.580135] [f8n02:81196:0] mpool.c:43 UCX WARN object 0x20708f00 was not returned to mpool ucp_requests
[1568724951.580137] [f8n02:81196:0] mpool.c:43 UCX WARN object 0x20709000 was not returned to mpool ucp_requests
[f8n02][/smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/test/gtest]>
Looking at file rma.inl shows
static UCS_F_ALWAYS_INLINE ucs_status_ptr_t
ucp_rma_send_request_cb(ucp_request_t *req, ucp_send_callback_t cb)
{
ucs_status_t status = ucp_request_send(req, 0);
if (req->flags & UCP_REQUEST_FLAG_COMPLETED) {
ucs_trace_req("releasing send request %p, returning status %s", req,
ucs_status_string(status));
ucs_mpool_put(req);
return UCS_STATUS_PTR(status);
}
which looks correct, the 'req' is returned to the pool with ucs_mpool_put. So why is the 'req' leaking ?
I found a problem; I was reusing a 'req', and ended up freeing it twice. I'm working on fixing that; the current problem is a hang. But I think the problem is in my code at the moment.
I have my code functional now, both for the application and the 'gtest'. As an aside, the 'gtest' for active messages doesn't actually test that the callback works; it relies on ucp_request_check_status to determine whether the AM is still in progress. Next to explore whether the code performs the way I want for nonblocking MPI collective calls, and to clean the code up so that it conforms with UCX coding guidelines. I'm currently at commit 6d80414d071d85501828472023dd1d2691cb66eb .
I'm doing some more testing. Trying active message over RDMA and using ucp_put_nb for a 1-byte memory region, I find that often the target byte is not written with the correct value. @yosefe are there any restrictions on the lengths that can be carried correctly by ucp_put_nb ? On doing more testing, I'm finding that the target address of the ucp_put_nb is always being left unmodified. I also frequently get messages from 'glibc' indicating that the heap has been corrupted, suggesting that the data is being written to a different address from that specified in the ucp_put_nb (or possibly being deferred in such a way that the 'completinn' uct am is overtaking the RDMA put).
After more testing, and some fixing of my code, I think there is a problem with in-memory transport when doing RDMA 'put'. I have added a checker in the server, to look at the first byte and the last byte of the memoty block transferred; this checker is telling me that the fist byte is not what it should be. The assert message is
[1,2]<stderr>:[c712f6n01:45008:0:45008] ucp_am.c:1368 Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed
and the backtrace is
(gdb) where
#0 0x000010000062f568 in pause () from /lib64/libc.so.6
#1 0x00001000030faae4 in ucs_debug_freeze () at debug/debug.c:710
#2 0x00001000030fe5f0 in ucs_error_freeze (message=0x3fffc617d2a0 "Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed") at debug/debug.c:829
#3 ucs_handle_error (message=0x3fffc617d2a0 "Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed") at debug/debug.c:992
#4 0x00001000030fa898 in ucs_fatal_error_message (file=0x10000304a118 "core/ucp_am.c", line=<optimized out>, function=<optimized out>,
message_buf=0x3fffc617d2a0 "Assertion `payload_data_first == unfinished->iovec_1_first_byte' failed") at debug/assert.c:33
#5 0x00001000030faa64 in ucs_fatal_error_format (file=0x10000304a118 "core/ucp_am.c", line=<optimized out>,
function=0x100003049db0 <__FUNCTION__.14370> "ucp_am_rdma_completion_handler", format=<optimized out>) at debug/assert.c:49
#6 0x0000100002fd8998 in ucp_am_rdma_completion_handler (am_arg=0x100021390010, am_data=<optimized out>, am_length=<optimized out>, am_flags=<optimized out>)
at core/ucp_am.c:1368
#7 0x0000100003095edc in uct_iface_invoke_am (flags=0, length=24, data=<optimized out>, id=<optimized out>, iface=0x1000213f0680)
at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/base/uct_iface.h:628
#8 uct_mm_iface_invoke_am (flags=0, length=24, data=<optimized out>, am_id=<optimized out>, iface=0x1000213f0680) at sm/mm/base/mm_iface.h:151
#9 uct_mm_iface_process_recv (iface=0x1001e653c10, elem=0x1000213f0680) at sm/mm/base/mm_iface.c:208
#10 0x00001000030961dc in uct_mm_iface_poll_fifo (iface=0x1001e653c10) at sm/mm/base/mm_iface.c:254
#11 uct_mm_iface_progress (arg=0x1001e653c10) at sm/mm/base/mm_iface.c:278
#12 0x0000100002ff5904 in ucs_callbackq_dispatch (cbq=<optimized out>) at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211
#13 uct_worker_progress (worker=<optimized out>) at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/api/uct.h:2203
#14 ucp_worker_progress (worker=0x100021390010) at core/ucp_worker.c:1892
#15 0x0000100002ccfa9c in LibColl::Adapter::UCXContext::advance (this=0x1001e9ad580) at ../../adapter/ucx/UCXContext.h:334
#16 0x0000100002cb9318 in LIBCOLL_Advance (ctxt=<optimized out>) at libcoll.cc:161
#17 0x0000100002ee2748 in blocking_coll (context=0x1001e9ad580, coll=0x3fffc618f590, active=0x3fffc618f74c) at ../collsel/init_util.h:130
#18 0x0000100002eefbe4 in LibColl::AdvisorTable::generate (this=0x1001ebce8b0, filename=0x0, params=0x3fffc618fbd0, ops=0x100002bf17b0 <external_geometry_ops>, mode=<optimized out>) at ../collsel/AdvisorTable.h:598
#19 0x0000100002ee59f0 in LibColl::CollselExtension::Collsel_table_generate (advisor=<optimized out>, filename=0x0, params=<optimized out>, ops=<optimized out>,
mode=<optimized out>) at ../collsel/CollselExtension.cc:95
#20 0x0000100002ee5a48 in Collsel_table_generate (advisor=<optimized out>, filename=<optimized out>, params=<optimized out>, ops=<optimized out>, mode=<optimized out>)
at CollselExtension.cc:26
#21 0x0000100002bcc4f0 in mca_coll_ibm_tune () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/spectrum_mpi/mca_coll_ibm.so
#22 0x0000100002ba4550 in mca_coll_ibm_hook_at_mpi_init_bottom () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/spectrum_mpi/mca_coll_ibm.so
#23 0x00001000001cb120 in ompi_hook_base_mpi_init_bottom () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#24 0x0000100000149ebc in ompi_mpi_init () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#25 0x000010000017f8f8 in PMPI_Init () from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#26 0x0000000010000888 in main ()
The test is failing close to the beginning of the application run, so it is likely that it is failing on a 1-byte ucp_put_nb. Tomorrow I will turn on trace and report more about what is happening. @yosefe are there any known issues with in-memory transport ? When I configured the test to run with 2 processes on different nodes (over IB), it ran cleanly.
I turned on UCX logging to investigate. The failing transfer was a 1-byte 'ucp_put_nb' over the in-memory (knem) transport. The data was either not written, or written to a different address than that specified by the ucp_put_nb. The ucp_put_nb returned with result UCS_OK, indicating that the 'put' had been done synchronously and ucp_put_nb thought it had succeeded. The receiving side had a memory registration for 33 bytes based 32 bytes below the RDMA target address, and the sending side had a memory registration for 1 byte at the RDMA source address. This is using 'knem' on IBM POWER8. I will adjust my test case to find whether in-memory transfers for larger than 1 byte are working.
I have a test case which calls MPI_Ibcast to do a nonblocking broadcast of a 32MB buffer around 4 ranks on 4 nodes, over Infiniband. I wait for completion of the MPI_Ibcast with a loop which does some floating-point work and then calls MPI_Testall. I am using an IBM proprietary collectives library which works by sending active messages, which can run over either IBM PAMI or UCP. By comparison with the IBM PAMI implementation, the UCP implementation requires about 10x as many calls to 'advance' (ucp_worker_progress). This results in a 10x longer elapsed time to complete the broadcasts. I am running with a 'release' build of UCX from the current master branch. Is UCP breaking the buffer into short segments, and requiring a call to ucp_worker_progress for each segment ? Here run_abbrev.log is a trace of the run; if you search for 'advancecount' you will see a count of the number of calls to ucp_worker_progress at each stage.