j3-fortran / fortran_proposals

Proposals for the Fortran Standard Committee
175 stars 14 forks source link

asynchronous task parallelism #271

Open jeffhammond opened 2 years ago

jeffhammond commented 2 years ago

We need a way to run things asynchronously within an address space. There are lots of use cases for this, one of which is running DO CONCURRENT regions on GPUs.

Summary

Starting in Fortran 2008, Fortran supports two forms of parallelism:

  1. DO CONCURRENT, which supports loop-level data parallelism.
  2. coarrays, which is form of PGAS.

This document will describe a third form of parallelism and argue that it should be supported by the Fortran language. The third form of parallelism is shared-memory task parallelism, which supports a range of use cases not easily covered by 1 and 2.

Background Reading

The reader may wish to consult the following for additional context on this topic:

Motivating Example

Consider the following Fortran program:

module numerot
  contains
    pure real function yksi(X)
      implicit none
      real, intent(in) :: X(100)
      !real, intent(out) :: R
      yksi = norm2(X)
    end function yksi
    pure real function kaksi(X)
      implicit none
      real, intent(in) :: X(100)
      kaksi = 2*norm2(X)
    end function kaksi
    pure real function kolme(X)
      implicit none
      real, intent(in) :: X(100)
      kolme = 3*norm2(X)
    end function kolme
end module numerot
program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC

  A = 1
  B = 1
  C = 1

  RA = yksi(A)
  RB = kaksi(B)
  RC = kolme(C)

  print*,RA+RB+RC
end program main

Assuming that yksi, kaksi, kolme share no state, then all three functions can execute concurrently.

How would we implement this in Fortran 2018?

One way is to use coarrays and assign each function to a different image.

program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: R

  A = 1
  B = 1
  C = 1

  if (num_images().ne.3) STOP

  if (this_image().eq.1) R = yksi(A)
  if (this_image().eq.2) R = kaksi(A)
  if (this_image().eq.3) R = kolme(A)

  SYNC ALL()

  call co_sum(R)
  if (this_image()) print*,R
end program main

While this works, this approach has many shortcomings. First, there is no way to share data directly between images - data must be explicitly copied using coarray operations. Second, images exist throughout the lifetime of the program (unless they fail) and thus the amount of parallelism is restricted to what is specified at runtime. Third, if there are many functions that can execute concurrently, many more than the number of images (which are likely to be processor cores or similar), then either the system will be oversubscribed or the user needs to implement scheduling by hand. Dynamic load-balancing is nontrivial and should not be delegated to application programmers in most cases.

Another way to implement this program is to use DO CONCURRENT:

program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC
  integer :: k

  A = 1
  B = 1
  C = 1

  do concurrent (k=1:3)

    if (k.eq.1) RA = yksi(A)
    if (k.eq.2) RB = kaksi(B)
    if (k.eq.3) RC = kolme(C)

  end do

  print*,RA+RB+RC
end program main

This could work if the external functions are declared PURE, but DO CONCURRENT provides no means for dynamic load-balancing. The bigger problem is that Fortran implementations cannot agree on what form of parallelism DO CONCURRENT uses. Some implementations will use threads while others will use vector lanes. The latter is going to be useless for most purposes. Finally, the above is ugly and tedious - no one wants to write code like that to execute independent tasks.

The OpenMP/OpenACC Solution

There is a proven solution for Fortran task parallelism in OpenMP (4.0 or later) or OpenACC:

program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC

  A = 1
  B = 1
  C = 1

  !$omp parallel
  !$omp master

  !$omp task
  RA = yksi(A)
  !$omp end task

  !$omp task
  RB = kaksi(B)
  !$omp end task

  !$omp task
  RC = kolme(C)
  !$omp end task

  !$omp end master
  !$omp end parallel

  print*,RA+RB+RC
end program main
program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC

  A = 1
  B = 1
  C = 1

  !$acc async
  RA = yksi(A)
  !$acc end async

  !$acc async
  RB = kaksi(B)
  !$acc end async

  !$acc async
  RC = kolme(C)
  !$acc end async

  !$acc async wait

  print*,RA+RB+RC
end program main

These programs will execute regardless of the available hardware parallelism, including sequentially. OpenMP tasking is more powerful in some use cases than OpenACC, by allowing the user to create dependencies between tasks, which forces the runtime to do more work when scheduling. This feature - tasks with dependencies - is not proposed for Fortran.

The Proposal for Fortran

Because OpenMP independent tasks is implemented in essentially all of the Fortran 2008 compilers, it is reasonable to assume that the design is portable. The goal here is to design a language feature for Fortran that is consistent with its existing semantics and syntax.

We consider the BLOCK construct to be an appropriate starting point, because it defines a scope, and scoping data is an essential part of defining task parallelism. Because we need more than just data scoping, we use the keyword task_block to tell the implementation that execution concurrency is both permitted and desirable.

program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC

  A = 1
  B = 1
  C = 1

  task_block
  RA = yksi(A)
  end task_block

  task_block
  RB = kaksi(B)
  end task_block

  task_block
  RC = kolme(C)
  end task_block

  task_sync all

  print*,RA+RB+RC
end program main

Non-trivial data issues

Obviously, very few programs can exploit concurrency where all data is strictly private. In DO CONCURRENT, locality specifiers are used to inform the implementation about whether data is shared, etc. (See this or this for details.)

Below we modify our program as if each function used a private scratch buffer. This is not the best way to allocate X, since X could be defined inside of the task_block scope or inside of the external function, but this is just an illustration of the syntax. We also add T, which could be a read-only lookup table, for example.

program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC
  real :: X(10)
  real :: T(1000)

  A = 1
  B = 1
  C = 1

  task_block local(X) shared(T)
  RA = yksi(A,X)
  end task_block

  task_block local(X) shared(T)
  RB = kaksi(B,X)
  end task_block

  task_block local(X) shared(T)
  RC = kolme(C,X)
  end task_block

  task_sync all

  print*,RA+RB+RC
end program main

Much like DO CONCURRENT, we should be able to write a fully explicit version using default(none).

program main
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC
  real :: X(10)
  real :: T(1000)

  A = 1
  B = 1
  C = 1

  task_block local_init(A) shared(RA) local(X) shared(T)
  RA = yksi(A,X)
  end task_block

  task_block local_init(B) shared(RB) local(X) shared(T)
  RB = kaksi(B,X)
  end task_block

  task_block local_init(C) shared(RC)  shared(T)
  RC = kolme(C,X)
  end task_block

  task_sync all

  print*,RA+RB+RC
end program main

It might make sense to have a new locality specifier, local_final but since there might have been a reason why that was not added for DO CONCURRENT, we use the shared specifier to the result of this function.

Dependencies

Many applications where task parallelism will be used have dependencies between tasks. For example, in our program, we can add a fourth function nalja that depends on yksi and kaksi.

program main
  use iso_fortran_env, only : task_depend_kind
  use numerot
  implicit none
  real :: A(100), B(100), C(100)
  real :: RA, RB, RC
  real :: X(10)
  real :: T(1000)
  type(task_depend_kind) :: DEP

  A = 1
  B = 1
  C = 1

  task_block depends_to(DEP)
  RA = yksi(A)
  end task_block

  task_block depends_to(DEP)
  RB = kaksi(B)
  end task_block

  task_block
  RC = kolme(C)
  end task_block

  task_block depend_from(DEP)
  RD = nalja(RA,RB)
  end task_block

  task_sync all

  print*,RC+RD
end program main

This syntax may not be ideal but it expresses the concept. In OpenMP, dependencies are expressed in the form of memory locations. Because this might be harder to implement in some scenarios, we propose an explicit opaque type that the implementation can use.

Known Shortcomings

Fortran lacks a memory model in the way that Java, C11 and C++11 do. We do not take a position on whether that is a good or bad thing, but instead attempt to make the fewest changes required to address hazards of concurrent data access by tasks.

One obvious solution for tasks is to reuse the coarray atomic operations, although this may not be acceptable to the committee. However, requiring that tasks use atomic operations to access data that may be modified by another task is a straightforward solution to these hazards. Unfortunately, the overhead of coarray atomics may be higher than acceptable for shared-memory uses, in which case a new syntax is required.

Acknowledgements

Thanks to the following people, who read this proposal or related material and may have provided feedback:

rouson commented 1 year ago

@jeffhammond Searching the issues on this repository on words like "task" and "asynchronous" doesn't seem to turn up anything similar to this proposal, which is pretty surprising. I would rank some form of asynchronous task parallelism as a very close second to generic programming in terms of its importance to the future of Fortran so I hope something like this makes it into a future standard -- hopefully 202Y. I recommend reviewing the "subimage" idea of Rasmussen et al. (2016), which also aimed to support task parallelism. Craig Rasmussen presented a tutorial on subimages to WG5 in Boulder in 2016. If you haven't already discussed this with Craig or others involved in that discussion, I recommend doing so in order to understand the issues discussed and the reasons that proposal didn't move forward.

ivan-pi commented 1 year ago

In line with other Fortran block constructs, it would be nice if tasks were named:

foo: task
  ra = yksi(a)
end task foo

bar: task depend(foo)  ! or, depend(in: ra)
  rb = nalja(ra)
end task bar

I'm not familiar enough with tasks to argue if the dependencies should be specified among task names, or like in OpenMP based upon the shared variables. I suppose with named tasks you could use the names, but with un-named tasks you could use variables (memory locations)? In any case I believe that in (read), out (write), and inout (read-write) are keywords that are familiar to Fortran programmers.

Would (named) tasks allow having an early exit statement or exiting from a block of nested tasks?

jeffhammond commented 1 year ago

@ivan-pi Thanks for the good suggestion. I agree that task blocks should be nameable.

As for dependencies, I am reluctant to go for that in the first round. OpenMP tasking started without dependencies, and when they added dependencies based on memory locations, it became significantly harder to implement. It's also inconsistent with asynchronous offload, which follows a queueing model rather than a full-service dependency graph. The OpenACC async model is a direct map onto CUDA streams and OpenCL queues, for example.