Form an asynchrony subgroup

rouson commented 1 year ago

Several recent ideas address the need for programmer-managed, single-image asynchrony -- namely #270, #271, and #272. I find these very compelling -- maybe even essential for the long-term survival of Fortran -- so I wonder if interested parties should join forces. Might asynchrony warrant attention by a dedicated subgroup working through the proposals over an extended period of time? As on example of a common theme, I think that issues #270 and #272 will require similar changes to the standard: #270 will require that the atom argument in atomic subroutines be allowed to be a noncoarray and #272 will require that objects of event_type be allowed to be noncoarrays.

It seems to me that these ideas are a fundamental shift to the semantics and runtime behavior of Fortran as big as adding coarrays was and I think that feature set was designed by a subgroup (HPC). I also wonder if something so significant might need to get started one or two standards ahead of adoption in order to address all the related issues. I'm thinking about how the generics subgroup started working on a Fortran 202Y feature while the rest of the committee was focused on 202X. If the analogy is appropriate, then single-image asynchrony might be a 202Z feature, in which case part of the purpose of my proposal is that I'd like to ensure that the interested parties don't lose steam if it doesn't win support right away. This seems like a dream worth seeing through when the inevitable hurdles arise.

wyphan commented 1 year ago

I'm interested. I think OpenMP task parallelism, introduced in OpenMP 3 specification, has been implemented pretty nicely for this concept. It would be awesome if someone could adapt it for the Fortran standard! Of course, it's not the only asynchronous model out there, so I think we have more options.

mjklemm commented 1 year ago

I'd be interested, too. I can bring a bit of OpenMP expertise into the group.

jefflarkin commented 1 year ago

I think this is an important topic and having a dedicated subgroup is a good way to have focused discussions and potentially standardize on solutions.

MichaelSiehl commented 1 year ago

Open your eyes, what you are asking for is already there in the most sophisticated ways. Coarray Fortran as a parallel programming language can be adapted (through customization) to any parallel programming paradigm that we need. This is extremely important to understand as we are already entering the era of spatial (data-flow) accelerators, with an emerging Intel ifx Fortran spatial compiler, and with a Fortran programming language that we already can use to deliver solutions (partly or completely) to any issue with spatial programming and compilation: https://domiyanyue.medium.com/challenge-of-spatial-accelerator-281fc327e665 (My preferred tool for this is still OpenCoarrays /gfortran as ifort has still more bugs. But my codes are working with both compilers exactly the same.)

A common trend with ordinary sequential programming languages to tackle for this, is to extend them with (asynchronous) coroutines (and channels) to give some sort of simple parallel programming functionality to the sequential programmer. A similar proposal was also made for Fortran some time ago: https://j3-fortran.org/doc/year/19/19-169.pdf . The focus of this was obviously on sequential Fortran (2003), but not with the coarray run-time in mind.

Coarray Fortran, on the other hand, is a parallel programming language with a dedicated runtime. To implement asynchronous code execution on the images with Coarray Fortran we can start with a more advanced construct to feed all images of a coarray team at the SPMD level. I am using composed coroutines (consisting of two or more coroutines) implemented in a single module procedure (or more if desired) to feed all images (at the SPMD level) not only with simple tasks but already with multiple asynchronous executing task pools instead (with fault tolerant code execution). I can already execute multiple such task pools on all coarray images of a coarray team simultaneously and asynchronously. Fortran’s rich syntax does naturally provide everything we need, with two exception: Firstly, we must implement a required non-blocking synchronization method ourselves through customization. I am still working on a new type of channel system, based on coarrays. A prototype channel does already work with very high performance. And secondly, we don’t have low-level access to notify the run-time about failed images that we may detect through a customized process: https://github.com/j3-fortran/fortran_proposals/issues/259

Spatial accelerators promise (much) higher performance with (much!) lower energy consumption (even for general purpose workloads) through extended data reuse (that is temporal data reuse extended by spatial data reuse as well as spatio-temporal data reuse). Coarray Fortran is already perfectly tailored to support that new era of green computing at the programming level: If upcoming Fortran spatial compilers will adopt the APGAS model https://www.cs.rochester.edu/u/cding/amp/papers/full/The%20Asynchronous%20Partitioned%20Global%20Address%20Space%20Model.pdf (at the level of the already existing coarray team syntax), we will be able to easily feed even a setup of multiple distinctly configured (heterogeneous) spatial accelerators with standard Fortran syntax.

All this may sound very complicated but the syntax for this can be kept very simple to even attract novice (parallel) programmers to new levels of green programming. A first example of a composed coroutine in a single module procedure from my codes is this below (multiple such module procedures do execute on all coarray images of a team simultaneously and asynchronously as the channels do synchronize non-blocking):

submodule (OOOPfrob03_01_cls_FM3_CE1_SPMD) OOOPfrob03_01_sub_FM3_CE1_SM2
! using Channel2 for the control and execute coroutines herein,
! and Channel4 for the send to OOOPfrob03_01_sub_FM3_CE1_SM3
implicit none
!
contains
!___________________________________________________________
!
module procedure frob03_01_FM3_CE1_SM2
!----------------------------------------------------------------------------------------
! (1) subtask1 block:
subtask1_if: if (i_Channel2Status == enum_Channel2Status_CE1 % subtask1) then
subtask1_block: block
  integer(glob_kint) :: i_TestValue
  real(glob_krea) :: r_TestValue
  integer(glob_kint), dimension(1:3) :: ia1_TestArray

  subtask1_select: select case (i_ImageType)
  !========================================================
  ! control coroutine:
  case (enum_ImageType % ControlImage) ! on the control image

    control_coroutine_subtask1: block
      i_TestValue = 22
      r_TestValue = 2.222
      ia1_TestArray = (/22,222,22222/)
      call chnl_Channel2 % fill (i_val = i_TestValue, r_val = r_TestValue, &
                                 ia1_val = ia1_TestArray)
      call chnl_Channel2 % send (i_chstat = i_Channel2Status)
      ! this image is ready for the next task:
      i_Channel2Status = enum_Channel2Status_CE1 % subtask2

      !*** sending to frob03_01_FM3_CE1_SM3 using Channel4:
      ! (we can use multiple channels in the same block for sending)
      if (i_Channel4Status == enum_Channel4Status_CE1 % subtask1) then
        r_TestValue = 4.444
        call chnl_Channel4 % fill (r_val = r_TestValue)
        call chnl_Channel4 % send (i_chstat = i_Channel4Status)
        ! this image is ready for the next task for Channel4:
        i_Channel4Status = enum_Channel4Status_CE1 % e_xit
      end if

    end block control_coroutine_subtask1
  !========================================================
  ! execute coroutine:
  case (enum_ImageType % ExecuteImage) ! on the execute images

    execute_coroutine_subtask1: block
      integer(glob_kint), dimension (1:1) :: ia1_ScalarInteger
      real(glob_krea), dimension (1:1) :: ra1_ScalarReal
      integer(glob_kint), dimension(1:3, 1:1) :: ia2_IntegerArray1D
      ! always use only a single channel within a block with IsReceive !
      ! (otherwise the data transfer through a channel won t synchronize successfully) !
      if (chnl_Channel2 % IsReceive (i_chstat = i_Channel2Status)) then
        call chnl_Channel2 % get (ia1_ScalarInteger = ia1_ScalarInteger, &
                                  ra1_ScalarReal = ra1_ScalarReal, &
                                  ia2_Integer1D = ia2_IntegerArray1D)
        i_TestValue = ia1_ScalarInteger (1)
        r_TestValue = ra1_ScalarReal (1)
        ia1_TestArray = ia2_IntegerArray1D (:,1)
        write(*,*) 'from channel2:', i_TestValue
        write(*,*) 'from channel2: ', r_TestValue
        write(*,*) 'from channel2: ', ia1_TestArray
        ! IsReceive was successful, this image is ready for the next task:
        i_Channel2Status = enum_Channel2Status_CE1 % subtask2
        call system_clock(count = i_Time1) ! reset the timer
      end if
    end block execute_coroutine_subtask1
  !========================================================
  ! error: unclassified image
  case default
    return
  end select subtask1_select

end block subtask1_block
end if subtask1_if
! (1) end subtask1 block
!----------------------------------------------------------------------------------------
!
!----------------------------------------------------------------------------------------
! (2) subtask2 block:
subtask2_if: if (i_Channel2Status == &
                 enum_Channel2Status_CE1 % subtask2) then
subtask2_block: block
  !
  subtask2_select: select case (i_ImageType)
  !========================================================
  ! execute coroutine:
  case (enum_ImageType % ExecuteImage) ! on the execute images

    execute_coroutine_subtask2: block
      integer(glob_kint) :: i_TestValue
      real(glob_krea) :: r_TestValue
      integer(glob_kint), dimension(1:3) :: ia1_TestArray
      i_TestValue = 222222
      r_TestValue = 22222.222
      ia1_TestArray = (/22222,2222222,22222222/)
      call chnl_Channel2 % fill (i_val = i_TestValue, r_val = r_TestValue, &
                                                    ia1_val = ia1_TestArray)
      call chnl_Channel2 % send (i_chstat = i_Channel2Status)
      ! this image is ready for the next task:
      i_Channel2Status = enum_Channel2Status_CE1 % e_xit
    end block execute_coroutine_subtask2
  !========================================================
  ! control coroutine:
  case (enum_ImageType % ControlImage) ! on the control image

    control_coroutine_subtask2: block
      integer(glob_kint), dimension (1:i_NumberOfExecutingImages) :: ia1_ScalarInteger
      real(glob_krea), dimension (1:i_NumberOfExecutingImages) :: ra1_ScalarReal
      integer(glob_kint), dimension(1:3, 1:i_NumberOfExecutingImages) :: ia2_IntegerArray1D
      ! always use only a single channel within a block with IsReceive !
      ! (otherwise the data transfer through a channel won t synchronize successfully) !
      if (chnl_Channel2 % IsReceive (i_chstat = i_Channel2Status)) then
        call chnl_Channel2 % get (ia1_ScalarInteger = ia1_ScalarInteger, &
                                  ra1_ScalarReal = ra1_ScalarReal, &
                                  ia2_Integer1D = ia2_IntegerArray1D)
        write(*,*) 'from channel2: ', ia1_ScalarInteger(:)
        write(*,*) 'from channel2: ', ra1_ScalarReal(:)
        write(*,*) 'from channel2: ', ia2_IntegerArray1D(:,:)
        ! IsReceive was successful, this image is ready for the next task:
        i_Channel2Status = enum_Channel2Status_CE1 % e_xit
        call system_clock(count = i_Time1) ! reset the timer
      end if
    end block control_coroutine_subtask2
  !========================================================
  ! error: unclassified image
  case default
    return
  end select subtask2_select
end block subtask2_block
end if subtask2_if
! (2) end subtask2 block
!----------------------------------------------------------------------------------------
end procedure frob03_01_FM3_CE1_SM2
!___________________________________________________________
!
end submodule OOOPfrob03_01_sub_FM3_CE1_SM2

jeffhammond commented 1 year ago

@MichaelSiehl You have responded with "open your eyes" regarding coarrays to the lead of the OpenCoarrays project. You may want to reflect on your style of online engagement and consider whether those of us involved in the Fortran asynchrony effort have already considered coarrays and excluded it for sound technical reasons.

certik commented 1 year ago

@MichaelSiehl thank you for your comment.

@jeffhammond, this forum is open to everyone, and the reason I helped create it is to connect the Fortran community. When you say "those of us involved in the Fortran asynchrony effort", whether you intended or not, it sounds to me like there might be some kind of a "hidden club of people" and if you are not part of this club, you are not allowed to express your opinion online and participate. So I just want to make clear to everyone here that everyone is welcome to participate and there is no such club here. @jeffhammond I recommend you to involve @MichaelSiehl and others in your efforts to improve parallel Fortran.

jeffhammond commented 1 year ago

There is no hidden club but I'm not interested in including people who spew condescension at others on GitHub without making any effort to understand the context for things. I wrote a blog post on asynchrony, which also appears as an issue here, that demonstrates that coarrays were considered and mentions why I do not believe they're the answer. In short, coarrays are not a shared memory model. Furthermore, they are a static resource. They fail to address the most basic needs of asynchronous tasks in prior art.

I am not blind to coarrays and neither is Damian. We do not need to open our eyes. We are just interested in pursuing a different, complementary path.

FortranFan commented 1 year ago

@jeffhammond wrote at Aug. 28, 2022, 10:19 AM:

.. I'm not interested in including people who spew condescension at others on GitHub without making any effort to understand the context for things ..

Who are this other "people who spew condescension" on this thread? Is there a miscommunication and misunderstanding of remarks that is being considered as "people who spew condescension"?

rouson commented 1 year ago

Hopefully @MichaelSiehl, a longtime and OpenCoarrays user who has put coarrays to great use, didn't scrutinize the list of issues cited in my original comment. These issues can't be addressed by coarrays in their current form. Hopefully the "open your eyes" comment was meant in jest, but tone is difficult to interpret in text.

MichaelSiehl commented 1 year ago

In fact, “Open your eyes” was meant as general comment towards anyone.

As I said, I am still working on the coarray-based channel implementation. Uploads will not be until the end of 2022 or some time in 2023. I will make the codes as simple and few as possible to open the door to others as much as I can. But programming then will be very different, not to mention the impact on algorithm development.

Fortran is a key technology to massively reduce energy consumption in computing with spacial accelerators, to develop the programming models and techniques for a new era of green computing. To receive attention we may communicate this aggressively, why not if it helps?

As a preview: I can use coarrays (atomic and non-atomic) to implement channels. But I can’t use coarrays then aside with such channels because the data transfer through a coarray-based channel and the data transfer through a coarray itself must be synchronized differently. Simply said, with a channel I did remove the blocking spin-wait loop away from the synchronization and did place it with the execution control of the coroutines, so that these coroutines can now execute permanently and asynchronously on each image within the same spin-wait loop. Execution control of the coroutines requires a programming model that must be implemented in correspondence with both, the coroutines as well as the channel. As far as I can tell, such channel systems could be outside the possibilities of an implementer (in any programming language) because the programmer must be able to customize these to specific requirements. The only PGAS language team else, that I am aware of, that did intend to implement (go-style) coroutines was the Chapel team, but I can’t find anything about the plans anymore. I don’t think it’s that easy, or even possible, for an implementer. In fact, the synchronization and data transfer inside my channel is a single process, and the “blocking” spin-wait loop is not even part of the channel. The synchronization is also used to control the execution flow among the coroutines.

cheers

rouson commented 1 year ago

@MichaelSiehl my original comment in this issue calls for "programmer-managed, single-image asynchrony." Please see the cited issues:

270 requires allowing different do concurrent iterations to reference and define an atomic variable even within a single image, whereas coarrays allow data referencing and definition in an image that executes asynchronously relative to other images,
271 explains why using coarrays for the intended use case is awkward and limiting, and
272 involves a non-coindexed event_type argument, which makes the standard's requirement that an event_type object be a coarray superfluous.

I suspect that leading with "Open your eyes..." is likely to achieve the opposite effect of what the phrase states.

rouson commented 1 year ago

@MichaelSiehl besides the inherent limitations of SPMD for task parallelism (namely that the number of images is fixed at program launch), I would expect only a minority of programmers to have the patience and sophistication to roll their own task pool. If you disagree, please post an example that has a small fraction the number of statements that you posted and preferably without all the unexplained magic numbers. To the extent that SPMD-based task parallelism can be useful, I worked with @everythingfunctional and others to write the FEATS task-scheduler, wherein the user need only to specify the vertices in a directed acyclic graph (e.g., as here) and the framework handles the execution of the tasks in a manner that respects the DAG, but even this approach still does nothing to address programmer-managed, single-image asynchrony as contemplated in my original comment.

rouson commented 1 year ago

@jeffhammond if I understand the outcome of last month's J3/WG5 meeting, I believe the HPC subgroup will take up the charge of single-image asynchrony so there's no need for a new subgroup. I'm therefore tempted to close this issue but for two useful things: (1) my original comment links three issues and that list could be expanded as necessary so that we have a place to track related proposals and (2) the original comment received six up-votes, which seems useful for demonstrating the popularity of the related issues.

jeffhammond commented 1 year ago

Let's leave it open for the reasons you mention, and if we don't make progress in the HPC subgroup, we can revisit this issue.

jefflarkin commented 1 year ago

@MichaelSiehl Co-Arrays are a fantastic abstraction for distributed-memory programming, but I don't think they're the right abstraction for tasking and fear that if the committee were to adapt them to become a solution to both problems they'll end up not solving either problem as well as a discrete solution for each. The solution for tasking certainly needs to be mindful of co-arrays and ensure that they compose well (unlike MPI, whose asynchrony currently doesn't compose well with OpenMP, OpenACC, CUDA, etc.), but I think the solution is going to be different than co-arrays because the problem its solving is sufficiently different from the problem co-arrays solve.

MichaelSiehl commented 1 year ago

@rouson From my viewpoint the SPMD model can become the key feature for efficient configuration of spatial accelerators using Fortran. My above code example alone is just to fill the images of a coarray team with a single ‘thread’ each, so to say. I am already able to run multiple such composed coroutines asynchronously to run multiple ‘threads’ simultaneously on each coarray image. (This alone does reduce the PGAS cost function dramatically, because each image does always execute some portion of a ‘thread’, the cost of the data transfers may decline toward zero). I am also able to switch tasks (composed coroutines), to change workloads on the ‘threads’ without leaving a current coarray team and without entering another coarray team. The user-defined channel does also allow to transfer (and collect) data across different ‘threads’ even if these do execute on distinct coarray images. Allocations that are required among the ‘threads’ are central at the module level. If I am correct, this already could allow to configure spatial accelerators homogeneously for heterogeneous workloads and to avoid reconfiguration (of the accelerator) for executing the (heterogeneous) dataflow. My hope is also, that this will lead to a high level of data reuse with spatial accelerators, to achieve highest performance with energy efficiency: https://arxiv.org/pdf/2106.10499.pdf

Please notice also that we do implement tasks/task pools as coroutines. My above syntax example is not so different from what we can see elsewhere: Some tutorial for coroutines in C# does start with a very similar syntax (even if it’s not the recommend syntax there): https://www.codeproject.com/Tips/5262735/What-is-a-Coroutine

@jefflarkin I use coarrays only to implement user-defined channels for use with coroutines. I can’t use coarrays else with asynchronous coroutine execution because of the required (blocking) synchronization process. But I would not be able to implement such channels without coarrays either. Thus, I must say despite minimal usage of coarrays in my codes, they are still essential as their properties, namely the symmetric memory, are still there if I use them as channels. This makes these user-defined channels highly efficient. The importance of coarrays for asynchronous task execution should not be underestimated: super-minimal usage of atomic coarrays for the synchronization, and minimal usage of non-atomic coarrays for the data transfer.

MichaelSiehl commented 1 year ago

@rouson: I‘ve just finished my first Github repository on dataflow/spatial programming and do also explain there how single-image asynchronous CAF programming does work. It does already work reliably on a CPU using ifort and gfortran/OpenCoarrays, and I am somewhat confident that my conclusions therein are validly:

https://github.com/MichaelSiehl/Spatial_Fortran_1

The relevant sections for single-image asynchronous CAF programming are:

Ordered Execution Segments with Non-Blocking Synchronization in Coarray Fortran
Non-Blocking Synchronization (required for implementing asynchronous coroutines) 8.1 Sequentially Consistent Memory Ordering 8.3 Asynchronous Code Execution on the Single Coarray Images

(Sequentially Consistent) Memory Ordering is explained for C++/DPC++ and is similar to Execution Segment Ordering in Coarray Fortran, as both lead into (globally) ordered atomic operations, as I understand it yet.

cheers

rouson commented 1 year ago

@MichaelSiehl thanks for all your hard work on this. I suspect your write-up will be very helpful as the Berkeley Lab LLVM Flang team continue work on a Coarray Fortran Runtime Design Document that we will present for the first time in a call with other LLVM FLang developers next week as a proposal for adoption once complete. The document is currently incomplete and we welcome contributions from anyone interested in helping us complete it. In particular, you're the only person who has ever mentioned sync memory in an OpenCoarrays issue so your experience with that statement is invaluable. Also, we recently held a Flang focus group discussion on sequentially consistent memory ordering at the February meeting of the US part of the Fortran standard committee so your insights in that area are likely to be very helpful.

certik commented 1 year ago

@rouson if at all possible, if you could ensure the Coarray API is compiler independent, then we can use it in LFortran as well, which would be very helpful, and GFortran could also use it.

jeffhammond commented 1 year ago

(Sequentially Consistent) Memory Ordering is explained for C++/DPC++ and is similar to Execution Segment Ordering in Coarray Fortran, as both lead into (globally) ordered atomic operations, as I understand it yet.

Can you elaborate on how sequentially consistency produces globally ordered atomics?

For example, x86 has sequentially consistent atomics but this does not create a global ordering.

MichaelSiehl commented 1 year ago

Can you elaborate on how sequentially consistency produces globally ordered atomics?

My mistake, it's the opposite of course:

Sequentially consistency does not produce globally ordered atomics, but it’s rather the other way round: The ‘single global order of all atomic operations’ is a key property of a ‘sequentially consistent memory ordering’, and the programmer must guarantee a ‘single global order of all atomic operations’ through a supporting programming model. As far as I understand it, this is required in parallel programming to prevent compilers and hardware from re-ordering operations (on each coarray image).

Sequentially Consistent Memory Ordering is explained for DPC++ (page 506 in the book), the crucial property is a “single global order of all atomic operations”...among all program instances. The memory order must be “supported by a combination of programming model and device”.

I did translate this to Coarray Fortran as such: single = the same atomic operation for each instance of an atomic coarray variable (in my programming a combination of atomic_define and atomic_ref on the coarray images) global = on all coarray images (of a coarray team)

Using standard Coarray Fortran with ordered execution segments, the ordering of the atomic operations is automatically established through the (blocking) synchronizations, so this is already a sequentially consistent memory ordering. As I am using a non-blocking synchronization method, I am no longer able to establish ordered execution segments through the non-blocking synchronizations alone and thus, no sequentially consistent memory ordering any more.

I am using an integer-based enumeration technique to assign values to atomic operations through the user-defined 'channels' in the coroutines. These assigned integer-based enum values are the key to ensure that atomic operations are ordered yet (one could also use integer values), as I use these values with the control flow in the coroutines on all coarray images (of a team).

j3-fortran / fortran_proposals

Form an asynchrony subgroup #274

270 requires allowing different `do concurrent` iterations to reference and define an atomic variable even within a single image, whereas coarrays allow data referencing and definition in an image that executes asynchronously relative to other images,

271 explains why using coarrays for the intended use case is awkward and limiting, and

272 involves a non-coindexed `event_type` argument, which makes the standard's requirement that an `event_type` object be a coarray superfluous.

j3-fortran / fortran_proposals

Form an asynchrony subgroup #274

270 requires allowing different do concurrent iterations to reference and define an atomic variable even within a single image, whereas coarrays allow data referencing and definition in an image that executes asynchronously relative to other images,

271 explains why using coarrays for the intended use case is awkward and limiting, and

272 involves a non-coindexed event_type argument, which makes the standard's requirement that an event_type object be a coarray superfluous.

270 requires allowing different `do concurrent` iterations to reference and define an atomic variable even within a single image, whereas coarrays allow data referencing and definition in an image that executes asynchronously relative to other images,

272 involves a non-coindexed `event_type` argument, which makes the standard's requirement that an `event_type` object be a coarray superfluous.