Open tjcw opened 4 years ago
Yes, the callback will only run once all the data has arrived, in this case 16MB. It was discussed having letting the callback be invoked for the arrival of each packet, but it has not been implemented. If you would like that feature as well as the informed data placement, we could discuss adding it, although I'm not entirely sure how without changing the definition of a ucp_am_callback_t.
I don't entirely understand "FIFO mode" vs "RDMA mode".
@tjcw If i understand the question correctly - there is no RDMA support in UCP AM today. Onle send/receive.
@snyjm-18 I think with the steam API the callback is triggered on arrival of each packet.
I believe so as well. That change is easy. The harder change is having all future messages with that same ID being placed into an application specific buffer once the first callback has run.
Let me first explain what we mean by FIFO vs RDMA. In PAMI, even the send can be implemented using RDMA where the header is sent first and in the dispatch the user callback sets the final destination then the whole message is delivered using RDMA. This is not the default though... The default is using packets (FIFO mode). On the delivery of the first packet (which is header+data), the user callback can set the location of the final destination (user buffer) so PAMI avoids intermediate copies and all data are delivered to user buffer
That makes sense, thank you for the explanation. I think that would be difficult to implement currently. However, if it helps, you can keep the buffer allocated by UCP by returning UCS_INPROGRESS, so you don't necessarily have to copy out the contents of the message.
@sssharka So, as you dispatch the first fragment in the software, the second fragment is cached somewhere in the NIC memory, since you cannot really dispatch it ? With PAMI over IB, how exactly you avoid intermediate copy. For the second packet to be processed on the NIC there must pre-posted receive for both header and payload.
no pre-post in active message... You only have the dispatch... but since the progress loop will not continue until the dispatch returns, dispatch will set the destination ... Any application layer, be it MPI for this example, will always have an early arrival buffer in case the receive is not posted yet.. In such case, final destination will be early arrival buffer...
@sssharka but in this case you have to allocate separate queue just for this ? Also it cannot be shared queue since you assume order.
@shamisp @sssharka @tjcw Should I implement a streaming version, where the callback is invoked for each segment? However, if you want to avoid multiple copies, you can just use the buffer allocated by UCP. The posting of an application buffer is a much larger undertaking.
multiple callback will actually have a negative effect... @shamisp the application layer needs to allocate the queue... The dispatch should be ready all the time regardless of a receive being posted or not...
@sssharka so it can scale only with UD. Sounds like RNDV protocol over UD. You have to send a message back to the initiator with the number of the queue ? Overhead for this is going to be pretty bad ?
In active message such as ucp_am_send_nb, the dispatch is triggered on receiving the entire message. This means that if I am sending 16MB, the dispatch will be triggered when 16MB is received in its entirety? Also, internally in the implementation, is there a switch between FIFO mode to RDMA mode based on message size? @sssharka wants to know ... We are porting an application from IBM PAMI to UCX. PAMI has a feature where the server side of an active message can be driven before the complete datastream is received at the server; the active message returns an indication of where the remaining data should be placed and another function to drive when the complete message is received. This feature can be used to place the data directly in the receiving application's data buffer.