Closed minsii closed 5 months ago
(quotes: emphasis mine)
- In the OpenSHMEM portion initialized by a
shmem_init_thread
call, a subsequent call toshmem_init_thread
may be unsuccessful and has no subsequent effect if the OpenSHMEM runtime cannot change the internal thread support; a subsequent call toshmem_init
has no subsequent effect.
I think we were careful in specifying that the success or failure of shmem_init_thread
is not strictly tied to the thread level that a library provides. That is to say that a subsequent call to shmem_init_thread
may be considered "successful" even if it does not change the thread-support level. Is it intended to change this?
Is it (expected to be) permitted that shmem_init_thread
can raise but not lower the thread level?
@nspark Thanks for the comments. I somehow misunderstood the condition of a successful shmem_init_thread
. I deleted the "unsuccessful" word in Change in specification-4
For the change of thread level at subsequent calls to shmem_init_thread
in the same OpenSHMEM portion, I originally considered that the implementation can support whatever they can (i.e., can raise or lower). However, on second thought I think it should be unrelated to the intent of this ticket, as we only want to enable subsequent init/finalize calls but not change the thread level within a portion. Thus, I changed Change in specification-4 to make the thread level unchangeable for simplicity.
@minsii Clarifying my question from Threads WG today. Per 3 in the outlined spec changes, there will only be one instance of the library that will be active. This issue does not capture the usecase where two different OpenSHMEM libraries are used concurrently. What we agreed was that being already a supported use by the spec. Also, as far as how to support shmem_
symbols from two libraries, I mentioned prefixing of API names. That could be done by the compiler as well. For example, __device__ shmem_int_p
is only call-able from a NVIDIA GPU. There could be similar annotations for other GPUs or FPGAs. Even though communication APIs can exist for different target processors, these libraries may still rely on shmem_init
and shmem_finalize
to be called by the host process.
This issue does not capture the usecase where two different OpenSHMEM libraries are used concurrently. What we agreed was that being already a supported use by the spec.
AFAIU, the specification doesn't speak about any interoperability between different SHMEM implementations. If we could simplify the above provided use case - we could ask whether it is possible to use RMA/AMO from an implementation and just collectives from another implementation? The spec doesn't allow this or better the spec doesn't address this usage model.
@anshumang Thanks for the clarification. If I understand your usecase correctly, the SHMEM init/finalize calls are still performed only on the host process, and multiple devices (e.g., GPUs) may share a single SHMEM portion initialized by the host process. Is it correct ? So the program might look like:
CPU side
shmem_init();
/* wait until GPUs finish work */
shmem_finalize();
GPU side:
/* computation */
shmem_int_p();
shmem_quiet();
...
I think both of us are OK that the second init call in init init finalize finalize
has no effect. The confusing part is the definition of "single OpenSHMEM portion" (i.e., you think the above program initializes a portion on CPU and a different portion on the GPU). Am I correct ?
I think the "portion" word is not well defined in this proposal...If we consider it as an isolated communication environment (e.g., each portion always use different network resource, shmem synchronization in one portion does not interfere with the other), then the above program might contain two "portions" (i.e., CPU initializes two sets of network resource, one for CPU and the other for GPU, shmem_quiet
on GPU only synchronizes RMA/AMO issued on GPU).
As @naveen-rn said, the current spec does not yet define the above model. I am not sure how to accurately describe this kind of communication environment. Let me think about how we can workaround it in the proposal.
@minsii Thanks for expanding on this. This is very useful. My comments below.
multiple devices (e.g., GPUs) may share a single SHMEM portion initialized by the host process
Does not have to be multiple GPUs. It could be a single GPU per host that calls into a SHMEM library but that library is initialized by the CPU. Alongside, the CPU thread could also use another CPU-side SHMEM library.
So I will modify the CPU side code example from
shmem_init();
/* wait until GPUs finish work */
shmem_finalize();
to
/*only GPU-side SHMEM*/
shmemx_init(); //GPU side SHMEM
/* wait until GPUs finish work or do GPU-side SHMEM based communication/synchronization*/
shmemx_barrier_all();
shmemx_finalize();
and
/*both CPU-side and GPU-side SHMEM*/
shmem_init(); //CPU side SHMEM
shmemx_init(); //GPU side SHMEM
/* wait until GPUs finish work or do CPU- and GPU-side SHMEM based communication/synchronization*/
shmem_int_p(ptr, my_pe, (my_pe+1)%n_pes);
shmem_barrier_all();
shmemx_barrier_all();
shmemx_finalize();
shmem_finalize();
The confusing part is the definition of "single OpenSHMEM portion" (i.e., you think the above program initializes a portion on CPU and a different portion on the GPU). Am I correct ?
Correct
@anshumang What I'm confused is this notion of GPU-side OpenSHMEM calls something different from the regular SHMEM calls. Aren't kernel initiated operations just another feature of a SHMEM library? Lets assume OpenSHMEM implementation A, which has support for both CPU-side and GPU-side operations. While implementation B has support only for CPU-side operation. For users to interoperate them, both the implementations needs to provide some form of support. Atleast in this case, implementation A has to block its CPU-side operations. I'm not sure whether any implementation is capable of doing this. Exposing features through implementation specific APIs is beyond the control of the specification.
Please correct me, if my understanding is wrong.
@naveen-rn It makes sense for kernel initiated SHMEM calls to operate out of a symmetric heap physically located on the GPU memory. Support for CPU-side and GPU-side operations mean that both calls use the GPU SHEAP. As such, an app may need to use another SHMEM library to use SHEAP backed on the system memory. The interoperability support could be available for "free" if the GPU-side SHMEM library only provides GPU-side operations (quiet possible). Compiler annotations are going to distinguish the SHMEM APIs in the GPU-side library from those in the CPU-side library. In such an interop scenario, do you think there is something for the spec to clarify?
AFAIU - you are referring to interoperate two different OpenSHMEM implementations. In general, I feel that the change to support this usage model requires a broader look at the specification. A small change like this PR wouldn't be sufficient. Also, I don't think this is the scope of the current proposal (@minsii correct me if I'm wrong).
Compiler annotations are going to distinguish the SHMEM APIs in the GPU-side library from those in the CPU-side library.
If I understand correctly, I think this will work only when one of the implementation supports GPU-side kernel initiated operation, and the other supports only the CPU-side operation. For example, if I could change the previous example, if both implementation A and B supports CPU and GPU initiated operation. In this case, I don't think compilers could differentiate operations from these two implementations.
If we need to support this usage model, then we would require different levels of OpenSHMEM compliance. Where an implementation could be designed to be modular in such a way that users could pick and choose features from different implementations and interoperate. For example, if we say that there are three levels of OpenSHMEM compliance:
PS: I would prefer not to go in direction. Just stating an example based on my understanding.
I think there are only a few conflict causing APIs or APIs that would reasonably be always invoked from the CPU thread for CPU or other backends (GPU, FPGA). These APIs are init
, finalize
, malloc
and free
. All other APIs could be backend specific and thus, may co-exist in the same app without conflicts. I am inclined to say that conformance level is an orthogonal issue making sense only for communication APIs (referencing #231 ). Even if all communication APIs were assigned a single level of conformance, defining interop of init
, finalize
, malloc
and free
remains open.
@anshumang According to your code examples, all the GPU-side APIs (e.g., shmemx_init|finalize
) are all implementation-specific extension. I do not think they have to be restricted by the semantics defined in SHMEM standard spec. The implementation can feel free to define that shmemx_init
will initialize the GPU-side SHMEM communication resource, while shmem_init
initializes the CPU-side resource.
I agree with @naveen-rn that the interoperability of two SHMEM implementations is out of the scope of this proposal, and we will need a much broader discussion covering all SHMEM APIs (not only init/finalize/malloc/free) if we want to support this case in the specification.
I think there are only a few conflict causing APIs or APIs that would reasonably be always invoked from the CPU thread for CPU or other backends (GPU, FPGA). These APIs are
init
,finalize
,malloc
andfree
. All other APIs could be backend specific and thus, may co-exist in the same app without conflicts.
I would think that this is only one approach to implement, but other approach may exist (e.g., one wants to support collectives among CPUs and devices ? ). This cannot be addressed by simply considering a subset of SHMEM APIs.
@minsii
According to your code examples, all the GPU-side APIs (e.g.,
shmemx_init|finalize
) are all implementation-specific extension.
I wrote them as shmemx_
because there can only be one shmem_init
symbol in the host binary. A non-CPU SHMEM backend can have shmem_init
and shmemx_init
and the user can choose to use shmem_init
if interop is not required.
I agree with @naveen-rn that the interoperability of two SHMEM implementations is out of the scope of this proposal, and we will need a much broader discussion covering all SHMEM APIs (not only init/finalize/malloc/free) if we want to support this case in the specification.
Agree that this needs to be covered in a separate proposal. I think init/finalize/malloc/free
is a good starting point for a valid usecase.
Trying to draft some text, here's what I have so far. (Note, these are not the complete description of the listed routines; I've tried to limit it to the most relevant sections.)
For shmem_init_thread
:
An OpenSHMEM program is initialized either by
shmem_init
orshmem_init_thread
. A call toshmem_init[_thread]
atomically increments an internal counter for the number of invocations of operations that successfully initialize the OpenSHMEM library. At program startup, this reference count is zero. At the end of the OpenSHMEM program, each successful initialization operation shall have a corresponding call toshmem_finalize
(i.e, the initialization count will return to zero), otherwise the behavior is undefined.Only the first call to
shmem_init[_thread]
initializes the OpenSHMEM library; i.e., when the initialization count increments from 0 → 1. Subsequent calls increment the initialization count and invoke a routine semantically equivalent toshmem_barrier_all
before returning.The
shmem_init_thread
routine is thread-safe. When invoked concurrently by multiple threads,shmem_init_thread
shall not return in any thread until the OpenSHMEM library is in an initialized state.
For shmem_finalize
:
A call to
shmem_finalize
atomically decrements an internal counter for the number of invocations of operations that successfully initialize the OpenSHMEM library. At program startup, this reference count is zero. At the end of the OpenSHMEM program, each successful initialization operation shall have a corresponding call toshmem_finalize
(i.e, the initialization count will return to zero), otherwise the behavior is undefined.On each call to
shmem_finalize
, the OpenSHMEM library invokes a routine semantically equivalent toshmem_barrier_all
before decrementing the initialization counter. Once the initialization counter returns to zero, the OpenSHMEM library releases all internal resources. [...quiet, teams, context...]
Some thoughts on what's missing:
shmem_finalize
are still valid provided it wasn't the call that decrements the init counter to zero; i.e., init-init-finalize-put-finalize is valid (but poor form unless implicit via libraries).shmem_initialized
and shmem_finalized
?
shmem_initialized
: has the library ever been initialized?shmem_finalized
: has every initialization operation been matched with a shmem_finalize
? @nspark Thanks for driving the draft. The text looks great to me.
A few comments on the missing items.
Need to preclude reinitialization (e.g., init-finalize-init-finalize) but not "nested reinitialization" (e.g., init-init-finalize-init-finalize-finalize).
Not sure if I understand it correctly, is your intention to support reinitialization? I am afraid that it is hard for many implementations.
Do we still need shmem_initialized and shmem_finalized?
I feel they are still useful. E.g., the user program may want to check whether any library has initialized SHMEM so that it can issue a PUT.
Need to preclude reinitialization (e.g., init-finalize-init-finalize) but not "nested reinitialization" (e.g., init-init-finalize-init-finalize-finalize).
Not sure if I understand it correctly, is your intention to support reinitialization? I am afraid that it is hard for many implementations.
No. But, if an init-finalize pair happens inside another init-finalize pair, the inner finalize shouldn't cause the library to be "completely finalized." (I need better or more precise terminology here.)
For example, this should be allowed (time flows top to bottom; fini == finalize):
app libA libB init-count
init 1
init 2
fini 1
init 2
fini 1
fini 0 → library is "finally finalized"
The following should also be allowed:
app libA libB init-count
init 1
init 2
init 3
fini 2
fini 1
fini 0 → library is "finally finalized"
However, the following should be disallowed:
app libA libB init-count
init 1
fini 0 → library is "finally finalized"
init * → erroneous reinitialization / UB
fini * →UB
Drafting the latest updates for these changes has me thinking: Is multithread initialization/finalization sanely permissible? In a sense, it seemed like we were trending toward allowing:
#pragma omp parallel
{
shmem_init_thread(SHMEM_THREAD_MULTIPLE, ...);
#pragma omp parallel
// ...do shmem stuff...
#pragma omp barrier
shmem_finalize();
}
However, shmem_init[_thread]
and shmem_finalize
each imply an operation equivalent to shmem_barrier_all
, and we currently preclude multithreaded collective operations. Assuming such a restriction persists, is the best we can do w.r.t. safe initialization of the OpenSHMEM library by other client libraries to include a note to developers?
Multithreaded init/finalize -- we can make this work by putting an init/finalize mutex into the library and only allowing one thread to enter the routine. But, unless there is a use case driving it, I'd rather not distract implementors with this.
We should clearly specify that the threading level returned by a call to shmem_init_thread must be greater than or equal to any threading level previously returned.
We should clearly specify that the threading level returned by a call to shmem_init_thread must be greater than or equal to any threading level previously returned.
Is there a use case that may need increased thread level with multiple init calls?
Slides from today: Multiple Init_Finalize.pdf
Need to clarify that init/finalize must be called by all PEs. That is, in a nested usage case, you can't call init on a subset of the PEs.
Goal
Allowing the user program to initialize and finalize SHMEM multiple times in order to support the scenario where SHMEM is used as the communication runtime of other libraries.
Problem Description
Current SHMEM spec defines that (1) multiple calls to
shmem_init|shmem_init_thread
within a program result in undefined behavior, and (2)shmem_finalize
must be the last OpenSHMEM library call encountered in the OpenSHMEM portion of a program.The above semantics allows the program to have at most one SHMEM portion, and interleaving calls to init or finalize become illegal. Thus, SHMEM is not able to be used as the communication runtime for multiple libraries. The following two examples show typical usages of init/finalize in this scenario, both of them are prohibited in current spec (
FOO
andBAR
can be either a library or the main program).Example-1
Example-2
Proposed Solution
Change in specification
shmem_init|shmem_init_thread
call, and ends with a call toshmem_finalize
.shmem_init|shmem_init_thread
call of an OpenSHMEM portion allocates and initializes resources for OpenSHMEM communication; the lastshmem_finalize
call of the same portion releases all resources initialized in this portion.shmem_init|shmem_init_thread
more than once within a program is permitted. However, callinginit, init, finalize, finalize
in a program will only initialize a single OpenSHMEM portion, even if the calls are made by different threads.shmem_init_thread|shmem_init
has no subsequent effect. The thread level cannot be changed after initialization.provided
parameter ofshmem_init_thread|shmem_query_thread
returns the thread level initialized in the current OpenSHMEM portion.Possible implementation
refcount
global variable, which is increased at everyshmem_init|shmem_init_thread
call, and decreased at everyshmem_finalize
call.shmem_init|shmem_init_thread
call only ifrefcount==0
; the resource is released at ashmem_finalize
call only ifrefcount==0
.Requirement to user
init, init, finalize
may cause unreleased resource in an OpenSHMEM portion, and subsequent calls to OpenSHMEM (except the call toshmem_finalize
) result in undefined behavior.Current Progress
This issue is separated from ticket #243 . See past discussion at #243.