openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.08k stars 412 forks source link

ucp active messages #4018

Open tjcw opened 4 years ago

tjcw commented 4 years ago

In active message such as ucp_am_send_nb, the dispatch is triggered on receiving the entire message. This means that if I am sending 16MB, the dispatch will be triggered when 16MB is received in its entirety? Also, internally in the implementation, is there a switch between FIFO mode to RDMA mode based on message size? @sssharka wants to know ... We are porting an application from IBM PAMI to UCX. PAMI has a feature where the server side of an active message can be driven before the complete datastream is received at the server; the active message returns an indication of where the remaining data should be placed and another function to drive when the complete message is received. This feature can be used to place the data directly in the receiving application's data buffer.

snyjm-18 commented 4 years ago

Yes, the callback will only run once all the data has arrived, in this case 16MB. It was discussed having letting the callback be invoked for the arrival of each packet, but it has not been implemented. If you would like that feature as well as the informed data placement, we could discuss adding it, although I'm not entirely sure how without changing the definition of a ucp_am_callback_t.

I don't entirely understand "FIFO mode" vs "RDMA mode".

yosefe commented 4 years ago

@tjcw If i understand the question correctly - there is no RDMA support in UCP AM today. Onle send/receive.

shamisp commented 4 years ago

@snyjm-18 I think with the steam API the callback is triggered on arrival of each packet.

snyjm-18 commented 4 years ago

I believe so as well. That change is easy. The harder change is having all future messages with that same ID being placed into an application specific buffer once the first callback has run.

sssharka commented 4 years ago

Let me first explain what we mean by FIFO vs RDMA. In PAMI, even the send can be implemented using RDMA where the header is sent first and in the dispatch the user callback sets the final destination then the whole message is delivered using RDMA. This is not the default though... The default is using packets (FIFO mode). On the delivery of the first packet (which is header+data), the user callback can set the location of the final destination (user buffer) so PAMI avoids intermediate copies and all data are delivered to user buffer

snyjm-18 commented 4 years ago

That makes sense, thank you for the explanation. I think that would be difficult to implement currently. However, if it helps, you can keep the buffer allocated by UCP by returning UCS_INPROGRESS, so you don't necessarily have to copy out the contents of the message.

shamisp commented 4 years ago

@sssharka So, as you dispatch the first fragment in the software, the second fragment is cached somewhere in the NIC memory, since you cannot really dispatch it ? With PAMI over IB, how exactly you avoid intermediate copy. For the second packet to be processed on the NIC there must pre-posted receive for both header and payload.

sssharka commented 4 years ago

no pre-post in active message... You only have the dispatch... but since the progress loop will not continue until the dispatch returns, dispatch will set the destination ... Any application layer, be it MPI for this example, will always have an early arrival buffer in case the receive is not posted yet.. In such case, final destination will be early arrival buffer...

shamisp commented 4 years ago

@sssharka but in this case you have to allocate separate queue just for this ? Also it cannot be shared queue since you assume order.

snyjm-18 commented 4 years ago

@shamisp @sssharka @tjcw Should I implement a streaming version, where the callback is invoked for each segment? However, if you want to avoid multiple copies, you can just use the buffer allocated by UCP. The posting of an application buffer is a much larger undertaking.

sssharka commented 4 years ago

multiple callback will actually have a negative effect... @shamisp the application layer needs to allocate the queue... The dispatch should be ready all the time regardless of a receive being posted or not...

shamisp commented 4 years ago

@sssharka so it can scale only with UD. Sounds like RNDV protocol over UD. You have to send a message back to the initiator with the number of the queue ? Overhead for this is going to be pretty bad ?