Closed mbianco closed 5 years ago
@boeschf @bettiolm The message does not have any concept of alignment, since I think that can be put in the Allocator. We should have a default allocator that is not std::allocator
, but does alignment and what else. For the MPI allocator, MPI_Alloc
may be the default allocator. What else do you need to be able to use it?
@mbianco Just a quick comment: for my "multicast" messages I would need a reference counter in the message, so that I know when it can be released. In my prototype I increased the ref counter in the send
method, and decreased the ref counter whenever a corresponding request finished. I also implemented a "multicast" comm object, which sent the message to all destination ranks, but I guess that can be done by the user by simply sending the same message to many destinations?
Another thing with the multicast is that I don't know how you want to handle the waiting for completion future. I guess you would then need an array of the communication requests that use the given message?
What would the reference count in the context of this layer? I will try something, I think I have an idea to preserve performance.
@mbianco I see you added a shared message. Thanks! I am still not sure I fully understand how your execution model looks like, and what is the user's, and ours responsibility. But let me write a very short pseudo-code to explain what I was thinking about.
The task scheduler runs the following loop:
// post all recv requests, which are required by the tasks this rank owns
...
while not end of simulation
// progress the comm and (in the callbacks) mark tasks as ready to be executed
do
communicator.progress()
while no ready tasks
// process a task that can be executed
task = pick_a_ready_task()
task.compute()
// schedule the task 'send' communication
task.send_to_vnbors()
Before the loop I already know the <rank,tag>
pairs of the current recv operations. Those should be posted, and after completed - reposted (persistent recv requests). Unless the tasks migrate between ranks, or are canceled, they will always be there, and always connected to the same message. In my prototype I re-posted in the progress function, before I called the notification callback. With the persistant MPI requests you've found this also needs to be done: they need to be 'started' every time a request is completed. In principle, the user could do it in the recv completion callback, but it IMO it is much easier done inside our communicator layer. Please let me know what you think about this.
The send_to_vnbors
function will each time create a new message, fill it with the task data (pack
) and send it to multiple ranks:
function send_to_vnbors
mesg = new message
pack(mesg, task)
for nbor in task.nbors
communicator.send(mesg, nbor, task.id)
In my case the messages are used by several send
requests, and must live longer than the scope of the function that submits the comm requests, because I don't wait for it's completion. Hence to de-allocate a message I need to count how many active send requests are using a given message, and I need a dynamically allocated message object. I see you've already implemented this. Going back to your question on what should be counted by this layer, in my execution model the message ref counter was increased inside each send
, and decreased after an associated request was completed (in my progress
function).
To make it simpler to receive the data, it would be good if the recv request completion notification callback takes as argument the message with the payload. The <rank,tag>
pair is not necessarily enough to identify a message: you also need the particular comm request id, which will not be available to the user. I would pass the message. Also, in the 'thin' layer we've discussed I need the message pointer to check it's ref count and release the buffers if all comm requests completed.
Another thing is the user_data
, which usually in such cases is a void *
. The callback is a global function, and the user_data
is used to provide it with a local context, e.g., the task object, for which the send operation was performed.
Please let me know what you think, and where you imagine the above should be implemented.
@angainor I'm working on it. What is not clear to the is the receiving side. I cannot repost a recv if the message has not been read yet. I think the pick_a_ready_task()
function does unpack
, and it is the task that will be used next, not the message. Is that right?
// progress the comm and (in the callbacks) mark tasks as ready to be executed do communicator.progress() while no ready tasks // process a task that can be executed task = pick_a_ready_task() task.compute() // schedule the task 'send' communication task.send_to_vnbors()
@mbianco
I cannot repost a recv if the message has not been read yet. I think the pick_a_ready_task() function does unpack, and it is the task that will be used next, not the message. Is that right?
In my prototype I do the unpack
inside the callback. Hence I can repost in the progress
function, after I call the callback. My progress function looks like this
for each completed request
call callback (if user registered it)
if persistent request
re-post
else
decrease usage counter on the message
if message.usage_count == 0 release message
In your code you iterate through callbacks, not completed requests. Hence I understood that the release of the message has to be moved to a higher layer, and I will always have to register a callback, in which I will have to check the usage counter and release the message. For that I need the corresponding message in the callback.
The natural place do do the unpacking is the callback registered for the recv messages. That's actually the most important purpose of the callback architecture. But to do that the callback needs needs the message structure (the buffer really) to unpack.
The above scenarios show why I believe we should pass the message to the callback: to check the usage counter and release the message / buffer (send op), and to read the buffer (recv op).
@angainor Thanks! That helps a lot! I'll work on this (not sure about this week, I will be in a workshop and it depends on how bored I am in the evenings :) )
@mbianco I'm in the opposite regime here ;) I have most time in the morning around 7, when everyone is asleep. My window closes at ~0830 ;)
@mbianco I'm in the opposite regime here ;) I have most time in the morning around 7, when everyone is asleep. My window closes at ~0830 ;)
I'm glad you're enjoying vacation :)
This PR is intended for you to look at and comment on undergoing development