Open eleon opened 10 months ago
What is proposed here simplifies the original design while avoiding the coexistence of two different paradigms (original QV + layouts). I think that the current examples will need some marginal modifications in both Pthreads and OpenMP cases. One potential issue is that qv_scope_get
only accepts intrinsic scopes as an argument at the moment and we might we to relax this. See my comment here.
Perhaps, we can discuss this in the next meeting?
I don't follow why qv_scope_get
is needed in the example (I also need to understand the need for qv_thread_context_create
). If a scope is already defined, e.g., mpi_numa_scope
, why not use it directly?
Perhaps, we can discuss this in the next meeting? Yes, I think so. I don't follow why
qv_scope_get
is needed in the example (I also need to understand the need forqv_thread_context_create
). If a scope is already defined, e.g.,mpi_numa_scope
, why not use it directly?
Please bear in mind that this example was written with the old "way" in mind.
qv_thread_context_create
was introduced to cache informations and possibly reuse it later instead of recomputing it for reach parallel section. IIRC, I needed a scope for each thread, hence the use of qv_scope_get
. But maybe my example was too convoluted and could be simplified. Anyway, should all threads use the same context (as you do in your example)? Is it an issue?
Thanks for clarifying, @GuillaumeMercier . Yes, once we make more progress on the new thread interface, it would be nice to simplify the example :) I can't think of any issues with all threads using the same context.
New names should be:
qv_thread_scope_split(...)
qv_thread_scope_split_at(..)
I'm trying to code both routines and I've got a few questions:
th_size
passed as a pointer? In which cases split
or split_at
are going to modifiy its value?split
. Am I expected to retrived this info from the scope? If yes, how to achieve this since in the regular (i.e non-thread
) version of split
, a type is always passed as an argument. @GuillaumeMercier , this was the first draft of the interface. The latest one is on Issues #68 and #69 (when we were discussing merging the process and thread interfaces). Please see below and let me know if you still have questions :)
int
qv_thread_scope_split(
qv_context_t *ctx,
qv_scope_t *scope,
int npieces,
int *group_id,
int nthreads,
qv_scope_t ***subscope
);
int
qv_thread_scope_split_at(
qv_context_t *ctx,
qv_scope_t *scope,
qv_hw_obj_type_t type,
int *group_id,
int nthreads,
qv_scope_t ***subscope
);
@eleon : Ok, thanks. But my second point is still valid, though.
Indeed, @GuillaumeMercier For now, I'd suggest to start with qv_thread_scope_split_at
, which does have a type.
For qv_thread_scope_split
, users do not have to think in terms of resource types, but in terms of how many pieces to split the input scope into. One could potentially implement qv_thread_scope_split
on top of qv_thread_scope_split_at
. Happy to talk more about this in the next call.
Yes, it's exactly my plan. But I you want to implement split
on top of split_at
you need to retrieve the type you're currently "working on" somehow, hence my second point.
Understood, @GuillaumeMercier . One possible way to implement split
on top of split_at
is to start with the NUMA type if there's enough nodes to satisfy the number of pieces
. Otherwise, pick the next level down the memory hierarchy (e.g., L3) and so on, until the L1 cache/core level. We can discuss in detail on the next call :)
Hmm. Not sure about that. To me, the logical thing would be to split from the current level of the context. Otherwise, what is the point of having it as a parameter? Or am I missing somrthing?
That makes sense, @GuillaumeMercier . Then, I would say start from the current level and down the hierarchy until you match the number of pieces
or the lowest level (PUs).
Yes, but how do you get the current level? That's my whole point. I don't see a way to do this currently. @samuelkgutierrez: do you have an idea?
Would something like this work, @GuillaumeMercier?
Get the current level by first getting the input scope's cpuset:qvi_hwpool_cpuset_get(scope->hwpool)
(example from https://github.com/hpc/quo-vadis/blob/master/src/qvi-scope.cc#L470-L476).
Then operate on the cpuset as necessary.
If this doesn't work for your use case, then we should have a talk offline.
I'll take a look and let you know. A question, though: would it be relevant to store the type information in the scope structure directly?
Can you expand on what you mean by 'type information?'
Sure. The scope is "associated" (for the lack of a better word to some resources, which have a type
(QV_HW_OBJ_XXX
of type qv_hw_obj_type_t
). So, would it be relevant to have a QV_HW_OBJ_XXX
for a scope?
Oh, I see.
A scope is associated with a hardware pool that manages the low-level details of the types of resources that a scope manages. In that way, a scope already has that type information associated with it. More directly to your question: the cpuset associated with a hardware pool more generally describes the CPU resources tied to that scope. We have functions that allow us to query the types of resources covered by that cpuset, but there may not be a perfect one-to-one correspondence, since the cpuset could be irregular, potentially covering multiple simple types that we expose to a user.
If you're interested, let's just have a call. Send me an email, if you are interested. Will be much quicker, I think.
I can't make a call today. But as I wrote, I'll try with the information you gave me (very helpful). Once I've tried to code something it will be more interesting to talk about this point again. As for the irregular cpusets management, I coded something about that in my Hsplit library recently to support some MPI 4.1 feature that could be applicable here too.
Sounds like a plan, @GuillaumeMercier.
@GuillaumeMercier qvi_global_split()
is a nice entry point that shows the different splitting and info functions called from a different context.
@GuillaumeMercier
qvi_global_split()
is a nice entry point that shows the different splitting and info functions called from a different context.
What about global_split_user_defined
?
Also a pointer to an object of type qvi_hwloc_t
can be obtained through either a context or a scope. What is the difference (if any)?
global_split_user_defined()
implements splitting of resources based on a user-defined coloring (e.g., from qv_scope_split()
). There are other examples like global_split_affinity_preserving()
that automatically split resources based on other criteria. The eventual goal is for you to be able to use this infrastructure for your mapping. Maybe it might work now, but I couldn't say that it does for sure. It may need some work.
To your second question: both should point to the same qvi_hwloc_t
instance.
I saw this comment in qvi_global_split
:
// If all the values are negative and equal, then auto split. If not, then
// we were called with an invalid request.
Is it really invalid? Maybe I'm biased by MPI but a process might not want to be part of a split by giving as a color MPI_UNDEFINED
(usually coded as a negative value). You then have a mix of positive and negative values for the colors and it's legit. Do we forbid this kind of behaviour in QV?
As of today, the comment is correct because we don't implement those semantics just yet.
I had my epiphany today about this split operation and:
split_hardware_ressources
, I was thinking on looking at
qvi_hwloc_split_cpuset_by_color
used in global_split_user_defined
@samuelkgutierrez, what is your take on this? You are correct that split_hardware_resources()
is collective, but that function should work for both local and distributed calls because of how the group code is designed. For example, the process interface is able to use the same infrastructure as the MPI interface because of this design.
I would suggest first seeing if the qvi_scope_split()
and qvi_scope_split_at()
interfaces will work for you. I think they should. If those don't seem to work, digging deeper into the interfaces might be okay initially (to get something implemented). However, the top-level calls should be expressive enough to handle your case as well. If not, we should probably fix them.
I just realized something, @GuillaumeMercier. qvi_scope_split()
et al. are limited to returning a single child, so you'll likely have to go another route for now because of this limitation.
Yes, I know but I was thinking about calling the routine num_colors
times (which is a very naïve way of doing things).
Ah, that might work! I hope it does.
cmake -DCMAKE_BUILD_TYPE=Debug
should enable debug symbols. I find that using ccmake
's TUI is nice to fine-tune the configuration, but it takes a little getting used to.
Aah. I was looking for this: ccmake
.
make VERBOSE=1
is also nice for debugging the build process.
I would suggest first seeing if the
qvi_scope_split()
andqvi_scope_split_at()
interfaces will work for you. I think they should. If those don't seem to work, digging deeper into the interfaces might be okay initially (to get something implemented). However, the top-level calls should be expressive enough to handle your case as well. If not, we should probably fix them.
The current issue with qvi_scope_split()
and qvi_scope_split_at()
is that currently, they call split_hardware_resources
which fails when the color number is larger than the number of required splits. This limit should be lifted. For instance, I should be able to use color 123 and 456 and ask for two splits. Also I don't understand why qvi_scope_split
needs a numbers of colors as an argument: this number should be determined by the color argument provided by the calling processes. Since this is a collective call, this piece of information could be determined and exchanged. Or maybe I am missing something? I admit that I'm heavily biased by MPI_Comm_split
and MPI_Comm_split_type
here.
All good points, @GuillaumeMercier. Let's add these issues to our meeting's agenda.
@samuelkgutierrez : I put this comment so that we can think about it before the meeting.
And yes, I'm stuck with the implementation ...
Fine-grain affinity for threads will use similar abstractions to those used for processes in QV.
Here are a few examples for OpenMP and POSIX threads. These assume the process already has a NUMA scope defined.
OpenMP example 1: Launch one thread per core.
OpenMP example 2: Launch two threads per core.
POSIX threads example:
This issue is related to issues #68 and #69. We want a single
qv_scope_split
and a singleqv_scope_split_at
, e.g., no specialqv_scope_th_split
.An actual Pthreads example can be found here: https://github.com/hpc/quo-vadis/blob/master/tests/test-progress-thread.c