openshmem-org / specification

OpenSHMEM Application Programming Interface
http://www.openshmem.org
51 stars 41 forks source link

Supporting multiple SHMEM initialize/finalize calls #263

Closed minsii closed 5 months ago

minsii commented 5 years ago

Goal

Allowing the user program to initialize and finalize SHMEM multiple times in order to support the scenario where SHMEM is used as the communication runtime of other libraries.

Problem Description

Current SHMEM spec defines that (1) multiple calls to shmem_init|shmem_init_thread within a program result in undefined behavior, and (2)shmem_finalize must be the last OpenSHMEM library call encountered in the OpenSHMEM portion of a program.

The above semantics allows the program to have at most one SHMEM portion, and interleaving calls to init or finalize become illegal. Thus, SHMEM is not able to be used as the communication runtime for multiple libraries. The following two examples show typical usages of init/finalize in this scenario, both of them are prohibited in current spec (FOO and BAR can be either a library or the main program).

Example-1

FOO_init();  // internally calls shmem_init
BAR_init(); // internally calls shmem_init
/* computation */
FOO_finalize(); // internally calls shmem_finalize
BAR_finalize(); // internally calls shmem_finalize

Example-2

FOO_init();  // internally calls shmem_init
/* computation 1 */
FOO_finalize(); // internally calls shmem_finalize
BAR_init(); // internally calls shmem_init
/* computation 2 */
BAR_finalize(); // internally calls shmem_finalize

Proposed Solution

Change in specification

  1. Multiple OpenSHMEM portions can exist within a program. An OpenSHMEM portion begins with a call to shmem_init|shmem_init_thread call, and ends with a call toshmem_finalize.
  2. The first shmem_init|shmem_init_thread call of an OpenSHMEM portion allocates and initializes resources for OpenSHMEM communication; the last shmem_finalize call of the same portion releases all resources initialized in this portion.
  3. Calling shmem_init|shmem_init_thread more than once within a program is permitted. However, calling init, init, finalize, finalize in a program will only initialize a single OpenSHMEM portion, even if the calls are made by different threads.
  4. In an OpenSHMEM portion, a subsequent call to shmem_init_thread|shmem_init has no subsequent effect. The thread level cannot be changed after initialization.
  5. The provided parameter of shmem_init_thread|shmem_query_thread returns the thread level initialized in the current OpenSHMEM portion.

Possible implementation

  1. The OpenSHMEM runtime may maintain an internal refcount global variable, which is increased at every shmem_init|shmem_init_thread call, and decreased at every shmem_finalize call.
  2. The resource is allocated and initialization at a shmem_init|shmem_init_thread call only if refcount==0; the resource is released at a shmem_finalize call only if refcount==0.

Requirement to user

  1. The user of OpenSHMEM is responsible to ensure that the same amount of init and finalize calls are made within a program. For instance, calling init, init, finalize may cause unreleased resource in an OpenSHMEM portion, and subsequent calls to OpenSHMEM (except the call to shmem_finalize) result in undefined behavior.

Current Progress

This issue is separated from ticket #243 . See past discussion at #243.

nspark commented 5 years ago

(quotes: emphasis mine)

  1. In the OpenSHMEM portion initialized by a shmem_init_thread call, a subsequent call to shmem_init_thread may be unsuccessful and has no subsequent effect if the OpenSHMEM runtime cannot change the internal thread support; a subsequent call to shmem_init has no subsequent effect.

I think we were careful in specifying that the success or failure of shmem_init_thread is not strictly tied to the thread level that a library provides. That is to say that a subsequent call to shmem_init_thread may be considered "successful" even if it does not change the thread-support level. Is it intended to change this?

Is it (expected to be) permitted that shmem_init_thread can raise but not lower the thread level?

minsii commented 5 years ago

@nspark Thanks for the comments. I somehow misunderstood the condition of a successful shmem_init_thread. I deleted the "unsuccessful" word in Change in specification-4

For the change of thread level at subsequent calls to shmem_init_thread in the same OpenSHMEM portion, I originally considered that the implementation can support whatever they can (i.e., can raise or lower). However, on second thought I think it should be unrelated to the intent of this ticket, as we only want to enable subsequent init/finalize calls but not change the thread level within a portion. Thus, I changed Change in specification-4 to make the thread level unchangeable for simplicity.

anshumang commented 5 years ago

@minsii Clarifying my question from Threads WG today. Per 3 in the outlined spec changes, there will only be one instance of the library that will be active. This issue does not capture the usecase where two different OpenSHMEM libraries are used concurrently. What we agreed was that being already a supported use by the spec. Also, as far as how to support shmem_ symbols from two libraries, I mentioned prefixing of API names. That could be done by the compiler as well. For example, __device__ shmem_int_p is only call-able from a NVIDIA GPU. There could be similar annotations for other GPUs or FPGAs. Even though communication APIs can exist for different target processors, these libraries may still rely on shmem_init and shmem_finalize to be called by the host process.

naveen-rn commented 5 years ago

This issue does not capture the usecase where two different OpenSHMEM libraries are used concurrently. What we agreed was that being already a supported use by the spec.

AFAIU, the specification doesn't speak about any interoperability between different SHMEM implementations. If we could simplify the above provided use case - we could ask whether it is possible to use RMA/AMO from an implementation and just collectives from another implementation? The spec doesn't allow this or better the spec doesn't address this usage model.

minsii commented 5 years ago

@anshumang Thanks for the clarification. If I understand your usecase correctly, the SHMEM init/finalize calls are still performed only on the host process, and multiple devices (e.g., GPUs) may share a single SHMEM portion initialized by the host process. Is it correct ? So the program might look like:

CPU side

shmem_init();
/* wait until GPUs finish work */
shmem_finalize();

GPU side:

/* computation */
shmem_int_p();
shmem_quiet();
...

I think both of us are OK that the second init call in init init finalize finalize has no effect. The confusing part is the definition of "single OpenSHMEM portion" (i.e., you think the above program initializes a portion on CPU and a different portion on the GPU). Am I correct ?

I think the "portion" word is not well defined in this proposal...If we consider it as an isolated communication environment (e.g., each portion always use different network resource, shmem synchronization in one portion does not interfere with the other), then the above program might contain two "portions" (i.e., CPU initializes two sets of network resource, one for CPU and the other for GPU, shmem_quiet on GPU only synchronizes RMA/AMO issued on GPU).

As @naveen-rn said, the current spec does not yet define the above model. I am not sure how to accurately describe this kind of communication environment. Let me think about how we can workaround it in the proposal.

anshumang commented 5 years ago

@minsii Thanks for expanding on this. This is very useful. My comments below.

multiple devices (e.g., GPUs) may share a single SHMEM portion initialized by the host process

Does not have to be multiple GPUs. It could be a single GPU per host that calls into a SHMEM library but that library is initialized by the CPU. Alongside, the CPU thread could also use another CPU-side SHMEM library.

So I will modify the CPU side code example from

 shmem_init();
/* wait until GPUs finish work */
shmem_finalize();

to

/*only GPU-side SHMEM*/
shmemx_init(); //GPU side SHMEM
/* wait until GPUs finish work or do GPU-side SHMEM based communication/synchronization*/
shmemx_barrier_all();
shmemx_finalize();

and

/*both CPU-side and GPU-side SHMEM*/
shmem_init(); //CPU side SHMEM
shmemx_init(); //GPU side SHMEM
/* wait until GPUs finish work or do CPU- and GPU-side SHMEM based communication/synchronization*/
shmem_int_p(ptr, my_pe, (my_pe+1)%n_pes);
shmem_barrier_all();
shmemx_barrier_all();
shmemx_finalize();
shmem_finalize();

The confusing part is the definition of "single OpenSHMEM portion" (i.e., you think the above program initializes a portion on CPU and a different portion on the GPU). Am I correct ?

Correct

naveen-rn commented 5 years ago

@anshumang What I'm confused is this notion of GPU-side OpenSHMEM calls something different from the regular SHMEM calls. Aren't kernel initiated operations just another feature of a SHMEM library? Lets assume OpenSHMEM implementation A, which has support for both CPU-side and GPU-side operations. While implementation B has support only for CPU-side operation. For users to interoperate them, both the implementations needs to provide some form of support. Atleast in this case, implementation A has to block its CPU-side operations. I'm not sure whether any implementation is capable of doing this. Exposing features through implementation specific APIs is beyond the control of the specification.

Please correct me, if my understanding is wrong.

anshumang commented 5 years ago

@naveen-rn It makes sense for kernel initiated SHMEM calls to operate out of a symmetric heap physically located on the GPU memory. Support for CPU-side and GPU-side operations mean that both calls use the GPU SHEAP. As such, an app may need to use another SHMEM library to use SHEAP backed on the system memory. The interoperability support could be available for "free" if the GPU-side SHMEM library only provides GPU-side operations (quiet possible). Compiler annotations are going to distinguish the SHMEM APIs in the GPU-side library from those in the CPU-side library. In such an interop scenario, do you think there is something for the spec to clarify?

naveen-rn commented 5 years ago

AFAIU - you are referring to interoperate two different OpenSHMEM implementations. In general, I feel that the change to support this usage model requires a broader look at the specification. A small change like this PR wouldn't be sufficient. Also, I don't think this is the scope of the current proposal (@minsii correct me if I'm wrong).

Compiler annotations are going to distinguish the SHMEM APIs in the GPU-side library from those in the CPU-side library.

If I understand correctly, I think this will work only when one of the implementation supports GPU-side kernel initiated operation, and the other supports only the CPU-side operation. For example, if I could change the previous example, if both implementation A and B supports CPU and GPU initiated operation. In this case, I don't think compilers could differentiate operations from these two implementations.

If we need to support this usage model, then we would require different levels of OpenSHMEM compliance. Where an implementation could be designed to be modular in such a way that users could pick and choose features from different implementations and interoperate. For example, if we say that there are three levels of OpenSHMEM compliance:

  1. Level:1 - Only RMA, AMO from CPU (generate libsma_host.so)
  2. Level:2 - Only RMA, AMO from device (generate libsma_device.so)
  3. Level:3 - Only collectives (generate libsma_coll.so) With this requirement from the specification, if implementations are really modular (hard to achieve), then the users could select different implementations for different functionality. At that time, we would require semantic change for supporting different initialization and finalization operations.

PS: I would prefer not to go in direction. Just stating an example based on my understanding.

anshumang commented 5 years ago

I think there are only a few conflict causing APIs or APIs that would reasonably be always invoked from the CPU thread for CPU or other backends (GPU, FPGA). These APIs are init, finalize, malloc and free. All other APIs could be backend specific and thus, may co-exist in the same app without conflicts. I am inclined to say that conformance level is an orthogonal issue making sense only for communication APIs (referencing #231 ). Even if all communication APIs were assigned a single level of conformance, defining interop of init, finalize, malloc and free remains open.

minsii commented 5 years ago

@anshumang According to your code examples, all the GPU-side APIs (e.g., shmemx_init|finalize) are all implementation-specific extension. I do not think they have to be restricted by the semantics defined in SHMEM standard spec. The implementation can feel free to define that shmemx_init will initialize the GPU-side SHMEM communication resource, while shmem_init initializes the CPU-side resource.

I agree with @naveen-rn that the interoperability of two SHMEM implementations is out of the scope of this proposal, and we will need a much broader discussion covering all SHMEM APIs (not only init/finalize/malloc/free) if we want to support this case in the specification.

I think there are only a few conflict causing APIs or APIs that would reasonably be always invoked from the CPU thread for CPU or other backends (GPU, FPGA). These APIs are init, finalize, malloc and free. All other APIs could be backend specific and thus, may co-exist in the same app without conflicts.

I would think that this is only one approach to implement, but other approach may exist (e.g., one wants to support collectives among CPUs and devices ? ). This cannot be addressed by simply considering a subset of SHMEM APIs.

anshumang commented 5 years ago

@minsii

According to your code examples, all the GPU-side APIs (e.g., shmemx_init|finalize) are all implementation-specific extension.

I wrote them as shmemx_ because there can only be one shmem_init symbol in the host binary. A non-CPU SHMEM backend can have shmem_init and shmemx_init and the user can choose to use shmem_init if interop is not required.

I agree with @naveen-rn that the interoperability of two SHMEM implementations is out of the scope of this proposal, and we will need a much broader discussion covering all SHMEM APIs (not only init/finalize/malloc/free) if we want to support this case in the specification.

Agree that this needs to be covered in a separate proposal. I think init/finalize/malloc/free is a good starting point for a valid usecase.

nspark commented 3 years ago

Trying to draft some text, here's what I have so far. (Note, these are not the complete description of the listed routines; I've tried to limit it to the most relevant sections.)

For shmem_init_thread:

An OpenSHMEM program is initialized either by shmem_init or shmem_init_thread. A call to shmem_init[_thread] atomically increments an internal counter for the number of invocations of operations that successfully initialize the OpenSHMEM library. At program startup, this reference count is zero. At the end of the OpenSHMEM program, each successful initialization operation shall have a corresponding call to shmem_finalize (i.e, the initialization count will return to zero), otherwise the behavior is undefined.

Only the first call to shmem_init[_thread] initializes the OpenSHMEM library; i.e., when the initialization count increments from 0 → 1. Subsequent calls increment the initialization count and invoke a routine semantically equivalent to shmem_barrier_all before returning.

The shmem_init_thread routine is thread-safe. When invoked concurrently by multiple threads, shmem_init_thread shall not return in any thread until the OpenSHMEM library is in an initialized state.

For shmem_finalize:

A call to shmem_finalize atomically decrements an internal counter for the number of invocations of operations that successfully initialize the OpenSHMEM library. At program startup, this reference count is zero. At the end of the OpenSHMEM program, each successful initialization operation shall have a corresponding call to shmem_finalize (i.e, the initialization count will return to zero), otherwise the behavior is undefined.

On each call to shmem_finalize, the OpenSHMEM library invokes a routine semantically equivalent to shmem_barrier_all before decrementing the initialization counter. Once the initialization counter returns to zero, the OpenSHMEM library releases all internal resources. [...quiet, teams, context...]

nspark commented 3 years ago

Some thoughts on what's missing:

minsii commented 3 years ago

@nspark Thanks for driving the draft. The text looks great to me.

A few comments on the missing items.

Need to preclude reinitialization (e.g., init-finalize-init-finalize) but not "nested reinitialization" (e.g., init-init-finalize-init-finalize-finalize).

Not sure if I understand it correctly, is your intention to support reinitialization? I am afraid that it is hard for many implementations.

Do we still need shmem_initialized and shmem_finalized?

I feel they are still useful. E.g., the user program may want to check whether any library has initialized SHMEM so that it can issue a PUT.

nspark commented 3 years ago

Need to preclude reinitialization (e.g., init-finalize-init-finalize) but not "nested reinitialization" (e.g., init-init-finalize-init-finalize-finalize).

Not sure if I understand it correctly, is your intention to support reinitialization? I am afraid that it is hard for many implementations.

No. But, if an init-finalize pair happens inside another init-finalize pair, the inner finalize shouldn't cause the library to be "completely finalized." (I need better or more precise terminology here.)

For example, this should be allowed (time flows top to bottom; fini == finalize):

app     libA    libB    init-count
init                    1
        init            2
        fini            1
                init    2
                fini    1
fini                    0 → library is "finally finalized"

The following should also be allowed:

app     libA    libB    init-count
init                    1
        init            2
                init    3
                fini    2
        fini            1
fini                    0 → library is "finally finalized"

However, the following should be disallowed:

app     libA    libB    init-count
        init            1
        fini            0 → library is "finally finalized"
                init    * → erroneous reinitialization / UB
                fini    * →UB
nspark commented 3 years ago

Drafting the latest updates for these changes has me thinking: Is multithread initialization/finalization sanely permissible? In a sense, it seemed like we were trending toward allowing:

#pragma omp parallel
{
  shmem_init_thread(SHMEM_THREAD_MULTIPLE, ...);
#pragma omp parallel
  // ...do shmem stuff...
#pragma omp barrier
  shmem_finalize();
}

However, shmem_init[_thread] and shmem_finalize each imply an operation equivalent to shmem_barrier_all, and we currently preclude multithreaded collective operations. Assuming such a restriction persists, is the best we can do w.r.t. safe initialization of the OpenSHMEM library by other client libraries to include a note to developers?

jdinan commented 3 years ago

Multithreaded init/finalize -- we can make this work by putting an init/finalize mutex into the library and only allowing one thread to enter the routine. But, unless there is a use case driving it, I'd rather not distract implementors with this.

We should clearly specify that the threading level returned by a call to shmem_init_thread must be greater than or equal to any threading level previously returned.

minsii commented 3 years ago

We should clearly specify that the threading level returned by a call to shmem_init_thread must be greater than or equal to any threading level previously returned.

Is there a use case that may need increased thread level with multiple init calls?

jdinan commented 8 months ago

Slides from today: Multiple Init_Finalize.pdf

jdinan commented 8 months ago

Need to clarify that init/finalize must be called by all PEs. That is, in a nested usage case, you can't call init on a subset of the PEs.