hpc / quo-vadis

A cross-stack coordination layer to dynamically map runtime components to hardware resources
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

New thread interface #70

Open eleon opened 5 months ago

eleon commented 5 months ago

Fine-grain affinity for threads will use similar abstractions to those used for processes in QV.

Here are a few examples for OpenMP and POSIX threads. These assume the process already has a NUMA scope defined.

OpenMP example 1: Launch one thread per core.

/* Use defaults for thread color and num. threads */ 
int *th_color = NULL;
int th_size;
qv_scope_t **th_scope; 

/* We want as many threads as cores */ 
qv_scope_th_split_at(ctx, numa_scope, QV_HW_OBJ_CORE, th_color, th_scope, &th_size);

omp_set_num_threads(th_size);
#pragma omp parallel
{
  qv_bind_push(ctx, th_scope[omp_get_thread_num()]);

  /* Each thread works */
}

/* When we are done with all parallel regions using this layout */ 
for (i=0; i<th_size; i++)
  free(th_scope[i]);

OpenMP example 2: Launch two threads per core.

/* We want 2 threads per core */ 
qv_scope_nobjs(ctx, numa_scope, QV_HW_OBJ_CORE, &ncores);
th_size = 2*ncores;
nchunks = ncores;
qv_scope_t **th_scope; 

for (i=0; i<th_size; i++)
  th_color[i] = i % ncores;  

qv_scope_th_split(ctx, numa_scope, nchunks, th_color, th_scope, &th_size);

omp_set_num_threads(th_size);
#pragma omp parallel
{
  qv_bind_push(ctx, th_scope[omp_get_thread_num()]);

  /* Each thread works */ 
}

/* When we are done with all parallel regions using this layout */ 
for (i=0; i<th_size; i++)
  free(th_scope[i]);

POSIX threads example:

void *thread_work(void *arg)
{
  /* Do the work */ 
}

int main(int argv, char *argv[])
{
  /* Get my numa_scope here */

  /* Use defaults for thread color and num. threads */ 
  int *th_color = NULL;
  int th_size;
  qv_scope_t **th_scope; 

  /* We want as many pthreads as hardware threads */ 
  qv_scope_th_split_at(ctx, numa_scope, QV_HW_OBJ_PU, th_color, th_scope, &th_size);

  pthread_t *thid = malloc(th_size * sizeof(pthread_t));
  pthread_attr_t *attr = NULL;   
  void *args = NULL;

  for (i=0; i<th_size; i++) 
    qv_pthread_create(&thid[i], attr, thread_work, args, ctx, th_scope[i]); 

  for (i=0; i<th_size; i++) 
    pthread_join(thid[i], &ret); 

  /* Can launch more parallel regions here before releasing the scopes */ 

  for (i=0; i<th_size; i++)
    free(th_scope[i]);
}

This issue is related to issues #68 and #69. We want a single qv_scope_split and a single qv_scope_split_at, e.g., no special qv_scope_th_split.

An actual Pthreads example can be found here: https://github.com/hpc/quo-vadis/blob/master/tests/test-progress-thread.c

GuillaumeMercier commented 5 months ago

What is proposed here simplifies the original design while avoiding the coexistence of two different paradigms (original QV + layouts). I think that the current examples will need some marginal modifications in both Pthreads and OpenMP cases. One potential issue is that qv_scope_get only accepts intrinsic scopes as an argument at the moment and we might we to relax this. See my comment here.

eleon commented 5 months ago

Perhaps, we can discuss this in the next meeting?

I don't follow why qv_scope_get is needed in the example (I also need to understand the need for qv_thread_context_create). If a scope is already defined, e.g., mpi_numa_scope, why not use it directly?

GuillaumeMercier commented 5 months ago

Perhaps, we can discuss this in the next meeting? Yes, I think so. I don't follow why qv_scope_get is needed in the example (I also need to understand the need for qv_thread_context_create). If a scope is already defined, e.g., mpi_numa_scope, why not use it directly?

Please bear in mind that this example was written with the old "way" in mind. qv_thread_context_create was introduced to cache informations and possibly reuse it later instead of recomputing it for reach parallel section. IIRC, I needed a scope for each thread, hence the use of qv_scope_get. But maybe my example was too convoluted and could be simplified. Anyway, should all threads use the same context (as you do in your example)? Is it an issue?

eleon commented 5 months ago

Thanks for clarifying, @GuillaumeMercier . Yes, once we make more progress on the new thread interface, it would be nice to simplify the example :) I can't think of any issues with all threads using the same context.

eleon commented 4 months ago

New names should be:

qv_thread_scope_split(...)
qv_thread_scope_split_at(..)
GuillaumeMercier commented 3 months ago

I'm trying to code both routines and I've got a few questions:

  1. why is th_size passed as a pointer? In which cases split or split_at are going to modifiy its value?
  2. No resource type is passed as an argument to split. Am I expected to retrived this info from the scope? If yes, how to achieve this since in the regular (i.e non-thread) version of split, a type is always passed as an argument.
eleon commented 3 months ago

@GuillaumeMercier , this was the first draft of the interface. The latest one is on Issues #68 and #69 (when we were discussing merging the process and thread interfaces). Please see below and let me know if you still have questions :)

eleon commented 3 months ago
int
qv_thread_scope_split(
    qv_context_t *ctx,
    qv_scope_t *scope,
    int npieces, 
    int *group_id,
    int nthreads, 
    qv_scope_t ***subscope
);
eleon commented 3 months ago
int
qv_thread_scope_split_at(
    qv_context_t *ctx,
    qv_scope_t *scope,
    qv_hw_obj_type_t type,
    int *group_id,
    int nthreads, 
    qv_scope_t ***subscope
);
GuillaumeMercier commented 3 months ago

@eleon : Ok, thanks. But my second point is still valid, though.

eleon commented 3 months ago

Indeed, @GuillaumeMercier For now, I'd suggest to start with qv_thread_scope_split_at, which does have a type.

For qv_thread_scope_split, users do not have to think in terms of resource types, but in terms of how many pieces to split the input scope into. One could potentially implement qv_thread_scope_split on top of qv_thread_scope_split_at. Happy to talk more about this in the next call.

GuillaumeMercier commented 3 months ago

Yes, it's exactly my plan. But I you want to implement split on top of split_at you need to retrieve the type you're currently "working on" somehow, hence my second point.

eleon commented 3 months ago

Understood, @GuillaumeMercier . One possible way to implement split on top of split_at is to start with the NUMA type if there's enough nodes to satisfy the number of pieces. Otherwise, pick the next level down the memory hierarchy (e.g., L3) and so on, until the L1 cache/core level. We can discuss in detail on the next call :)

GuillaumeMercier commented 3 months ago

Hmm. Not sure about that. To me, the logical thing would be to split from the current level of the context. Otherwise, what is the point of having it as a parameter? Or am I missing somrthing?

eleon commented 3 months ago

That makes sense, @GuillaumeMercier . Then, I would say start from the current level and down the hierarchy until you match the number of pieces or the lowest level (PUs).

GuillaumeMercier commented 3 months ago

Yes, but how do you get the current level? That's my whole point. I don't see a way to do this currently. @samuelkgutierrez: do you have an idea?

samuelkgutierrez commented 3 months ago

Would something like this work, @GuillaumeMercier?

Get the current level by first getting the input scope's cpuset:qvi_hwpool_cpuset_get(scope->hwpool) (example from https://github.com/hpc/quo-vadis/blob/master/src/qvi-scope.cc#L470-L476).

Then operate on the cpuset as necessary.

If this doesn't work for your use case, then we should have a talk offline.

GuillaumeMercier commented 3 months ago

I'll take a look and let you know. A question, though: would it be relevant to store the type information in the scope structure directly?

samuelkgutierrez commented 3 months ago

Can you expand on what you mean by 'type information?'

GuillaumeMercier commented 3 months ago

Sure. The scope is "associated" (for the lack of a better word to some resources, which have a type (QV_HW_OBJ_XXX of type qv_hw_obj_type_t). So, would it be relevant to have a QV_HW_OBJ_XXX for a scope?

samuelkgutierrez commented 3 months ago

Oh, I see.

A scope is associated with a hardware pool that manages the low-level details of the types of resources that a scope manages. In that way, a scope already has that type information associated with it. More directly to your question: the cpuset associated with a hardware pool more generally describes the CPU resources tied to that scope. We have functions that allow us to query the types of resources covered by that cpuset, but there may not be a perfect one-to-one correspondence, since the cpuset could be irregular, potentially covering multiple simple types that we expose to a user.

If you're interested, let's just have a call. Send me an email, if you are interested. Will be much quicker, I think.

GuillaumeMercier commented 3 months ago

I can't make a call today. But as I wrote, I'll try with the information you gave me (very helpful). Once I've tried to code something it will be more interesting to talk about this point again. As for the irregular cpusets management, I coded something about that in my Hsplit library recently to support some MPI 4.1 feature that could be applicable here too.

samuelkgutierrez commented 3 months ago

Sounds like a plan, @GuillaumeMercier.

samuelkgutierrez commented 3 months ago

@GuillaumeMercier qvi_global_split() is a nice entry point that shows the different splitting and info functions called from a different context.

GuillaumeMercier commented 3 months ago

@GuillaumeMercier qvi_global_split() is a nice entry point that shows the different splitting and info functions called from a different context.

What about global_split_user_defined? Also a pointer to an object of type qvi_hwloc_t can be obtained through either a context or a scope. What is the difference (if any)?

samuelkgutierrez commented 3 months ago

global_split_user_defined() implements splitting of resources based on a user-defined coloring (e.g., from qv_scope_split()). There are other examples like global_split_affinity_preserving() that automatically split resources based on other criteria. The eventual goal is for you to be able to use this infrastructure for your mapping. Maybe it might work now, but I couldn't say that it does for sure. It may need some work.

To your second question: both should point to the same qvi_hwloc_t instance.

GuillaumeMercier commented 3 months ago

I saw this comment in qvi_global_split :

// If all the values are negative and equal, then auto split. If not, then
// we were called with an invalid request. 

Is it really invalid? Maybe I'm biased by MPI but a process might not want to be part of a split by giving as a color MPI_UNDEFINED (usually coded as a negative value). You then have a mix of positive and negative values for the colors and it's legit. Do we forbid this kind of behaviour in QV?

samuelkgutierrez commented 3 months ago

As of today, the comment is correct because we don't implement those semantics just yet.

GuillaumeMercier commented 3 months ago

I had my epiphany today about this split operation and:

  1. Thread split and regular split are different operations: one is local while the other is collective (global). Therefore having two different function makes totally sense
  2. Therefore, I don't think that looking at the other (collective) split operations to have an idea on how to implement the thread version is the way to go. For instance, instead of using split_hardware_ressources, I was thinking on looking at qvi_hwloc_split_cpuset_by_color used in global_split_user_defined @samuelkgutierrez, what is your take on this?
samuelkgutierrez commented 3 months ago

You are correct that split_hardware_resources() is collective, but that function should work for both local and distributed calls because of how the group code is designed. For example, the process interface is able to use the same infrastructure as the MPI interface because of this design.

I would suggest first seeing if the qvi_scope_split() and qvi_scope_split_at() interfaces will work for you. I think they should. If those don't seem to work, digging deeper into the interfaces might be okay initially (to get something implemented). However, the top-level calls should be expressive enough to handle your case as well. If not, we should probably fix them.

samuelkgutierrez commented 3 months ago

I just realized something, @GuillaumeMercier. qvi_scope_split() et al. are limited to returning a single child, so you'll likely have to go another route for now because of this limitation.

GuillaumeMercier commented 3 months ago

Yes, I know but I was thinking about calling the routine num_colors times (which is a very naïve way of doing things).

samuelkgutierrez commented 3 months ago

Ah, that might work! I hope it does.

samuelkgutierrez commented 3 months ago

cmake -DCMAKE_BUILD_TYPE=Debug should enable debug symbols. I find that using ccmake's TUI is nice to fine-tune the configuration, but it takes a little getting used to.

GuillaumeMercier commented 3 months ago

Aah. I was looking for this: ccmake.

samuelkgutierrez commented 3 months ago

make VERBOSE=1 is also nice for debugging the build process.

GuillaumeMercier commented 2 months ago

I would suggest first seeing if the qvi_scope_split() and qvi_scope_split_at() interfaces will work for you. I think they should. If those don't seem to work, digging deeper into the interfaces might be okay initially (to get something implemented). However, the top-level calls should be expressive enough to handle your case as well. If not, we should probably fix them.

The current issue with qvi_scope_split() and qvi_scope_split_at() is that currently, they call split_hardware_resources which fails when the color number is larger than the number of required splits. This limit should be lifted. For instance, I should be able to use color 123 and 456 and ask for two splits. Also I don't understand why qvi_scope_split needs a numbers of colors as an argument: this number should be determined by the color argument provided by the calling processes. Since this is a collective call, this piece of information could be determined and exchanged. Or maybe I am missing something? I admit that I'm heavily biased by MPI_Comm_split and MPI_Comm_split_type here.

samuelkgutierrez commented 2 months ago

All good points, @GuillaumeMercier. Let's add these issues to our meeting's agenda.

GuillaumeMercier commented 2 months ago

@samuelkgutierrez : I put this comment so that we can think about it before the meeting.

GuillaumeMercier commented 2 months ago

And yes, I'm stuck with the implementation ...