stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
106 stars 28 forks source link

hoist_expressions in GPU script can prevent collapsing loops (I think?) #2573

Open LonelyCat124 opened 5 months ago

LonelyCat124 commented 5 months ago

I'm looking at a section of the newer socrates code, where there is a loop:

    DO k_inner=1, n_k_term_inner
      k = (k_outer-1)*nd_k_term_inner + k_inner
      ! Set the combination of terms for this iteration of the loop
      iex_major(k_inner) = iex_major_all(i_term_reduced(k))
      DO j=1, n_abs
        iex_minor(j, k_inner) = iex_minor_all(j, i_term_reduced(k))
        i_abs = index_abs(j)
        ! Set the ESFT term for this absorber
        iex = i_esft_pointer(j, i_term_reduced(k))
        k_esft(1:n_profile, 1:n_layer, i_abs, k_inner) = &
          k_abs_layer(1:n_profile, 1:n_layer, iex, i_abs)
      END DO
      ! The product of the ESFT weights
      product_weight(k_inner) = product_weight_all(i_term_reduced(k))
    END DO ! k_inner

I wanted to (hopefully) collapse the k_inner and j loops, so I refactored the code as follows (and removed the iex_major writes to a separate loop):

    DO k_inner=1, n_k_term_inner
      DO j=1, n_abs
             k = (k_outer-1)*nd_k_term_inner + k_inner
 ...

This should result in the same behaviour, not contain any race conditions etc.

However, the default script (that should have good general behaviour) prevents this as it hoists this statement (since its loop independent of the outer loop) up to the outer loop, preventing collapse:

Do k_inner=1, n_k_term_inner
  k = (k_outer-1)*nd_k_term_inner + k_inner
  Do j=1, n_abs
...

I think for GPUs we should be more careful with this hoisting behaviour, and use the collapse logic as well, e.g.

@sergisiso @arporter what do you think? Would this be a reasonable general rule to use instead of the current one?

sergisiso commented 5 months ago

Makes sense, note that the hoisting happens in the script call to normalise, so you can turn it off:

normalise_loops(
          invoke.schedule,
          hoist_local_arrays=True,
          convert_array_notation=True,
          loopify_array_intrinsics=True,
          convert_range_loops=True,
          hoist_expressions=False
  )

or make it subroutine specific with hoist_expressions=False if invoke.name = "mysubroutine" else True

The current issue for making the hoisting aware of collapse, it that the collapse logic still lives outside src (in examples/nemo/utils.py). This was temporal, at some point I wanted to bring it somewhere inside source. At least as a transformation option for the omp transformations, but maybe even in a more general place (collapse if basically and extension of proving the iteration independence but for more than one contiguous loops). Then we can have an option in the hoisting transformation to enable or disable hoisting this statements.

sergisiso commented 5 months ago

Oh and btw the OpenMP standard does not specify that the collapsed loops have to be perfectly nested, this was usually the case but some latest versions of compilers support simple statements in the middle of the loops like the one you shown, but I unsure if we should let this happen because still some compilers will fail.

LonelyCat124 commented 5 months ago

re: collapse + compilers maybe we should test with: Recent-ish gcc (9 or 10?) Recent-ish intel (2021? 2022?) Recent-ish Nvidia Recent-ish cray (if we have access somewhere?) MO current preferred compiler(s)

If those all accept the new collapse then I think we're probably ok to move towards the 5.1 standard with how we handle collapse.

I'll try to check that file in detail and check if there's any other things that should be hoisted, but if not I'll try disabling hoist for that file.