Requirement: fetching atomic operations in DO CONCURRENT

jeffhammond commented 2 years ago

This is my attempt to propose that we support fetching atomic operations in DO CONCURRENT.

Motivation

Atomic operations are an integral part of concurrent algorithms. Currently, Fortran has atomic operations for coarray data, but not DO CONCURRENT regions. The only legal way to access data concurrently within a DO CONCURRENT region is to use the REDUCE locality specifier, but the effect of this is only visible at the end, which means it is impossible to implement classes of algorithms where the results of atomic operations must be known during a DO CONCURRENT region.

One common pattern where fetching atomic operations are used is in the parallel construction of a list. This pattern appears in particle codes, where a parallel agent needs to acquire a unique offset into an array in order to write its data.

INTEGER, INTENT(IN) :: maxsize, num, maxtempsize
INTEGER :: offset, i, mysize
REAL :: array(:), temp(:)
ALLOCATE( array(maxsize) , temp(maxtempsize) )
DO CONCURRENT i=1:num
    CALL determine(temp, mysize)
    offset = ATOMIC_FETCH_ADD( offset, mysize )
    array(offset:offset+mysize) = temp(1:mysize)
END DO

There is currently no mechanism to write such code in Fortran today, except as a distributed memory algorithm using coarrays.

Proposed Solutions

There are at least three possible options for solving this problem:

Relax the restrictions on the existing intrinsic procedure, ATOMIC_FETCH_ADD to allow it to operate on any memory, not just coarray memory.
Add a locality specifier that ensures accesses are atomic within a given DO CONCURRENT region.
Add a type specifier that prescribes that all accesses to a given variable be atomic.

There are positive and negative aspects to each of these options.

The current use cases and implementations of ATOMIC_FETCH_ADD assume distributed memory is supported and thus may have higher overhead than is appropriate for shared-memory. Furthermore, ensuring that ATOMIC_FETCH_ADD works for both distributed memory and shared memory may slow down the distributed memory implementation.

Using a locality specifier is good in that it means the implementation only needs to generate atomic operations in the scope of DO CONCURRENT, which may be more efficient than if done across the lifetime of a variable. On the other hand, it may limit our ability to add other forms of shared-memory concurrency in the future.

A type specifier is consistent with the existing art in C11 and C++11 atomics, which means it the implementation is well understood. If limited to certain native types, efficient implementations are straightforward. However, if allowed on arbitrary types, the implementation may be required to do more complicated things, such as lock tables, which may not be desirable.

Prior Art

This feature is supported by a wide range of hardware and software, including, but not limited to:

Fortran coarrays
MPI-3 RMA
OpenSHMEM
OpenMP
OpenACC
C++11
C11
Widely used hardware:
- x86_64 (via the lock prefix)
- ARMv7 (via load-link-store-conditional)
- ARMv8.1 (natively)
- PowerPC (via load-link-store-conditional)
- IBM Power 9 / PowerPC 3.0 (natively)
- RISC-V (natively)
- CUDA
- PCIe 3.0
- InfiniBand

We note that proposed solution 1 is equivalent to Fortran coarrays and similar to MPI-3 and OpenSHMEM. Proposed solution 2 is somewhat similar to OpenMP and OpenACC, but more restrictive than either. Proposed solution 3 is very similar to C11 and C++11.

rouson commented 2 years ago

@jeffhammond would it make sense to also allow event_type to be a non-coarray? I haven't used Fortran's atomic subroutines, but I have frequently used event_type, which is an intrinsic derived type that contains a private atomic integer. Currently, an event_type must be a coarray.

klausler commented 2 years ago

As is well known, DO CONCURRENT is not a concurrent construct; it is instead a sequential construct whose iterations are unordered. This unfortunate fact has implications for atomics.

Assume that some forms of fetching atomic subroutines become allowed in a DO CONCURRENT. The semantics of DO CONCURRENT execution require that each iteration execute completely in some unspecified order. Consequently, if atomics are inserted into the standard in the obvious way, a program can (or must) assume that two or more atomic operations in the same iteration will execute without being interleaved with any atomic operations to the same variable(s) from other iterations.

We obviously want to allow for true parallel execution of the iterations of a DO CONCURRENT construct. So take care when defining the semantics of atomics in this context so that either the language describing the sequential execution of iterations is fixed, or that some disclaimer is added so that a program has to allow for interleaving of atomic operations from distinct iterations.

rouson commented 2 years ago

Is it also time to consider a new do parallel or some such that makes a third pass (counting forall as the first pass) at getting loop-level parallelism right? Might it be the case that implementors of such a new feature would be able to borrow enough from their work on supporting do concurrent that it wouldn't be costly to support the new feature? In fact, it might be simpler to support because it wouldn't have the hairy pitfalls of do concurrent.

jeffhammond commented 2 years ago

Yes. Either we fix CONCURRENT per @klausler or we add PARALLEL (for example) that's actually parallel in the sense we need it. We could also try to capture what C++ does with par and par_unseq if we're feeling bold.

rouson commented 2 years ago

I imagine that this proposal will motivate us to remove the requirement that the atom argument of atomic subroutines "shall be a scalar coarray or coindexed object." This requirement appears in the definition of every atomic subroutine. The proposed use of atomic operations in do concurrent would make this requirement superfluous because the proposal would make atomics useful in even in single-image execution.

certik commented 2 years ago

@klausler has submitted a paper to fix do concurrent but it didn't go anywhere. We'll have to renew the effort.

wyphan commented 2 years ago

Agreed, especially since now I'm starting development for additional DO CONCURRENT support in GFortran, this issue has risen in importance for me to understand fully.

On Fri, Jul 8, 2022, 13:53 Ondřej Čertík @.***> wrote:

@klausler https://github.com/klausler has submitted a paper to fix do concurrent but it didn't go anywhere. We'll have to renew the effort.

— Reply to this email directly, view it on GitHub https://github.com/j3-fortran/fortran_proposals/issues/270#issuecomment-1179234162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMERY5H4TXCL4FLCEASL3BLVTBTJDANCNFSM52SNBLZA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

klausler commented 2 years ago

J3 is never going to acknowledge their bungling of DO CONCURRENT now, much less fix it. Don't make atomics depend on DO CONCURRENT being fixed first.

My point above was to raise the problem up front of the difficulty you will have defining the semantics of atomics in the context of the current non-concurrent DO CONCURRENT construct for standardization purposes in a way that is internally consistent without accidentally imposing unintentional weird restrictions or semantics on the obvious straightforward implementations of atomics. In other words, the definition of atomics has to be right when read in two distinct contexts: the real world of parallel execution, and standard's odd world in which DO CONCURRENT is an unordered sequential loop construct.

j3-fortran / fortran_proposals