Open jeffhammond opened 2 years ago
@jeffhammond Searching the issues on this repository on words like "task" and "asynchronous" doesn't seem to turn up anything similar to this proposal, which is pretty surprising. I would rank some form of asynchronous task parallelism as a very close second to generic programming in terms of its importance to the future of Fortran so I hope something like this makes it into a future standard -- hopefully 202Y. I recommend reviewing the "subimage" idea of Rasmussen et al. (2016), which also aimed to support task parallelism. Craig Rasmussen presented a tutorial on subimages to WG5 in Boulder in 2016. If you haven't already discussed this with Craig or others involved in that discussion, I recommend doing so in order to understand the issues discussed and the reasons that proposal didn't move forward.
In line with other Fortran block constructs, it would be nice if tasks were named:
foo: task
ra = yksi(a)
end task foo
bar: task depend(foo) ! or, depend(in: ra)
rb = nalja(ra)
end task bar
I'm not familiar enough with tasks to argue if the dependencies should be specified among task names, or like in OpenMP based upon the shared variables. I suppose with named tasks you could use the names, but with un-named tasks you could use variables (memory locations)? In any case I believe that in
(read), out
(write), and inout
(read-write) are keywords that are familiar to Fortran programmers.
Would (named) tasks allow having an early exit
statement or exiting from a block of nested tasks?
@ivan-pi Thanks for the good suggestion. I agree that task blocks should be nameable.
As for dependencies, I am reluctant to go for that in the first round. OpenMP tasking started without dependencies, and when they added dependencies based on memory locations, it became significantly harder to implement. It's also inconsistent with asynchronous offload, which follows a queueing model rather than a full-service dependency graph. The OpenACC async
model is a direct map onto CUDA streams and OpenCL queues, for example.
We need a way to run things asynchronously within an address space. There are lots of use cases for this, one of which is running
DO CONCURRENT
regions on GPUs.Summary
Starting in Fortran 2008, Fortran supports two forms of parallelism:
DO CONCURRENT
, which supports loop-level data parallelism.This document will describe a third form of parallelism and argue that it should be supported by the Fortran language. The third form of parallelism is shared-memory task parallelism, which supports a range of use cases not easily covered by 1 and 2.
Background Reading
The reader may wish to consult the following for additional context on this topic:
Motivating Example
Consider the following Fortran program:
Assuming that
yksi
,kaksi
,kolme
share no state, then all three functions can execute concurrently.How would we implement this in Fortran 2018?
One way is to use coarrays and assign each function to a different image.
While this works, this approach has many shortcomings. First, there is no way to share data directly between images - data must be explicitly copied using coarray operations. Second, images exist throughout the lifetime of the program (unless they fail) and thus the amount of parallelism is restricted to what is specified at runtime. Third, if there are many functions that can execute concurrently, many more than the number of images (which are likely to be processor cores or similar), then either the system will be oversubscribed or the user needs to implement scheduling by hand. Dynamic load-balancing is nontrivial and should not be delegated to application programmers in most cases.
Another way to implement this program is to use
DO CONCURRENT
:This could work if the external functions are declared
PURE
, butDO CONCURRENT
provides no means for dynamic load-balancing. The bigger problem is that Fortran implementations cannot agree on what form of parallelismDO CONCURRENT
uses. Some implementations will use threads while others will use vector lanes. The latter is going to be useless for most purposes. Finally, the above is ugly and tedious - no one wants to write code like that to execute independent tasks.The OpenMP/OpenACC Solution
There is a proven solution for Fortran task parallelism in OpenMP (4.0 or later) or OpenACC:
These programs will execute regardless of the available hardware parallelism, including sequentially. OpenMP tasking is more powerful in some use cases than OpenACC, by allowing the user to create dependencies between tasks, which forces the runtime to do more work when scheduling. This feature - tasks with dependencies - is not proposed for Fortran.
The Proposal for Fortran
Because OpenMP independent tasks is implemented in essentially all of the Fortran 2008 compilers, it is reasonable to assume that the design is portable. The goal here is to design a language feature for Fortran that is consistent with its existing semantics and syntax.
We consider the
BLOCK
construct to be an appropriate starting point, because it defines a scope, and scoping data is an essential part of defining task parallelism. Because we need more than just data scoping, we use the keywordtask_block
to tell the implementation that execution concurrency is both permitted and desirable.Non-trivial data issues
Obviously, very few programs can exploit concurrency where all data is strictly private. In
DO CONCURRENT
, locality specifiers are used to inform the implementation about whether data is shared, etc. (See this or this for details.)Below we modify our program as if each function used a private scratch buffer. This is not the best way to allocate X, since X could be defined inside of the
task_block
scope or inside of the external function, but this is just an illustration of the syntax. We also add T, which could be a read-only lookup table, for example.Much like
DO CONCURRENT
, we should be able to write a fully explicit version usingdefault(none)
.It might make sense to have a new locality specifier,
local_final
but since there might have been a reason why that was not added forDO CONCURRENT
, we use theshared
specifier to the result of this function.Dependencies
Many applications where task parallelism will be used have dependencies between tasks. For example, in our program, we can add a fourth function
nalja
that depends onyksi
andkaksi
.This syntax may not be ideal but it expresses the concept. In OpenMP, dependencies are expressed in the form of memory locations. Because this might be harder to implement in some scenarios, we propose an explicit opaque type that the implementation can use.
Known Shortcomings
Fortran lacks a memory model in the way that Java, C11 and C++11 do. We do not take a position on whether that is a good or bad thing, but instead attempt to make the fewest changes required to address hazards of concurrent data access by tasks.
One obvious solution for tasks is to reuse the coarray atomic operations, although this may not be acceptable to the committee. However, requiring that tasks use atomic operations to access data that may be modified by another task is a straightforward solution to these hazards. Unfortunately, the overhead of coarray atomics may be higher than acceptable for shared-memory uses, in which case a new syntax is required.
Acknowledgements
Thanks to the following people, who read this proposal or related material and may have provided feedback: