initial version to play with for transport layer and corresponding message type

mbianco commented 5 years ago

This PR is intended for you to look at and comment on undergoing development

mbianco commented 5 years ago

@boeschf @bettiolm The message does not have any concept of alignment, since I think that can be put in the Allocator. We should have a default allocator that is not std::allocator, but does alignment and what else. For the MPI allocator, MPI_Alloc may be the default allocator. What else do you need to be able to use it?

angainor commented 5 years ago

@mbianco Just a quick comment: for my "multicast" messages I would need a reference counter in the message, so that I know when it can be released. In my prototype I increased the ref counter in the send method, and decreased the ref counter whenever a corresponding request finished. I also implemented a "multicast" comm object, which sent the message to all destination ranks, but I guess that can be done by the user by simply sending the same message to many destinations?

Another thing with the multicast is that I don't know how you want to handle the waiting for completion future. I guess you would then need an array of the communication requests that use the given message?

mbianco commented 5 years ago

What would the reference count in the context of this layer? I will try something, I think I have an idea to preserve performance.

angainor commented 5 years ago

@mbianco I see you added a shared message. Thanks! I am still not sure I fully understand how your execution model looks like, and what is the user's, and ours responsibility. But let me write a very short pseudo-code to explain what I was thinking about.

The task scheduler runs the following loop:

// post all recv requests, which are required by the tasks this rank owns
...

while not end of simulation

    // progress the comm and (in the callbacks) mark tasks as ready to be executed
    do
        communicator.progress()
    while no ready tasks

    // process a task that can be executed
    task = pick_a_ready_task()
    task.compute()

    // schedule the task 'send' communication
    task.send_to_vnbors()

Before the loop I already know the <rank,tag> pairs of the current recv operations. Those should be posted, and after completed - reposted (persistent recv requests). Unless the tasks migrate between ranks, or are canceled, they will always be there, and always connected to the same message. In my prototype I re-posted in the progress function, before I called the notification callback. With the persistant MPI requests you've found this also needs to be done: they need to be 'started' every time a request is completed. In principle, the user could do it in the recv completion callback, but it IMO it is much easier done inside our communicator layer. Please let me know what you think about this.

The send_to_vnbors function will each time create a new message, fill it with the task data (pack) and send it to multiple ranks:

function send_to_vnbors
    mesg = new message
    pack(mesg, task)
    for nbor in task.nbors
        communicator.send(mesg, nbor, task.id)

In my case the messages are used by several send requests, and must live longer than the scope of the function that submits the comm requests, because I don't wait for it's completion. Hence to de-allocate a message I need to count how many active send requests are using a given message, and I need a dynamically allocated message object. I see you've already implemented this. Going back to your question on what should be counted by this layer, in my execution model the message ref counter was increased inside each send, and decreased after an associated request was completed (in my progress function).

To make it simpler to receive the data, it would be good if the recv request completion notification callback takes as argument the message with the payload. The <rank,tag> pair is not necessarily enough to identify a message: you also need the particular comm request id, which will not be available to the user. I would pass the message. Also, in the 'thin' layer we've discussed I need the message pointer to check it's ref count and release the buffers if all comm requests completed.

Another thing is the user_data, which usually in such cases is a void *. The callback is a global function, and the user_data is used to provide it with a local context, e.g., the task object, for which the send operation was performed.

Please let me know what you think, and where you imagine the above should be implemented.

mbianco commented 5 years ago

@angainor I'm working on it. What is not clear to the is the receiving side. I cannot repost a recv if the message has not been read yet. I think the pick_a_ready_task() function does unpack, and it is the task that will be used next, not the message. Is that right?

// progress the comm and (in the callbacks) mark tasks as ready to be executed
do
    communicator.progress()
while no ready tasks

// process a task that can be executed
task = pick_a_ready_task()
task.compute()

// schedule the task 'send' communication
task.send_to_vnbors()

angainor commented 5 years ago

@mbianco

I cannot repost a recv if the message has not been read yet. I think the pick_a_ready_task() function does unpack, and it is the task that will be used next, not the message. Is that right?

In my prototype I do the unpack inside the callback. Hence I can repost in the progress function, after I call the callback. My progress function looks like this

for each completed request
     call callback (if user registered it)
     if persistent request
           re-post
     else
           decrease usage counter on the message
           if message.usage_count == 0 release message

In your code you iterate through callbacks, not completed requests. Hence I understood that the release of the message has to be moved to a higher layer, and I will always have to register a callback, in which I will have to check the usage counter and release the message. For that I need the corresponding message in the callback.

The natural place do do the unpacking is the callback registered for the recv messages. That's actually the most important purpose of the callback architecture. But to do that the callback needs needs the message structure (the buffer really) to unpack.

The above scenarios show why I believe we should pass the message to the callback: to check the usage counter and release the message / buffer (send op), and to read the buffer (recv op).

mbianco commented 5 years ago

@angainor Thanks! That helps a lot! I'll work on this (not sure about this week, I will be in a workshop and it depends on how bored I am in the evenings :) )

angainor commented 5 years ago

@mbianco I'm in the opposite regime here ;) I have most time in the morning around 7, when everyone is asleep. My window closes at ~0830 ;)

mbianco commented 5 years ago

@mbianco I'm in the opposite regime here ;) I have most time in the morning around 7, when everyone is asleep. My window closes at ~0830 ;)

I'm glad you're enjoying vacation :)

ghex-org / GHEX

initial version to play with for transport layer and corresponding message type #8