`OMPTaskTrans` / `InliningTrans` / ? resulting in incorrect data sharing clauses

LonelyCat124 commented 7 months ago

When running OMPTaskTrans (with manually predone inlining trans) there's an issue with renamed variables not appearing in data sharing clauses.

When inlining momentum_u_code we get the expected behaviour: firstprivate(j_out_var,j_el_inner,u_e,depe,u_w,depw,v_sc,v_s,deps,v_nc,**v_n**,depn, (there's an argument over whether these should be firstprivate or not but thats not really important for now).

When then inlining momentum_v_code, the inlining transformation renames v_n to v_n_1 to prevent collision (makes sense), however the private/firstprivate clauses (and dependency clauses) on this task still use v_n instead of v_n_1.

I think this is happening as the parent parallel clause doesn't seem to know anything about v_n_1, and thus it is considered to be shared by the tasks, so shouldn't be declared private or firstprivate, however this doesn't explain why the task's clauses contain v_n which is never used in the task, so shouldn't be inside the clauses unless renaming is happening later than I expect?

LonelyCat124 commented 7 months ago

@arporter I'm debugging this and I'm a bit confused how the renaming/inlining works slash I think it produces incorrect code with OpenMP.

I'd assumed that when we inline code we would rename any symbols that collide at the time we inline (line 162 in inline_trans does a table.merge). However, I'm not sure this works how I expected, as it looks for the containing scoping region (in this case a loop) and only merge's into the loop's symbol table (I think).

I've then generated clauses (Parallel, task) which use the inlined variable names (which are still the original ones because the containing loop has an empty symbol table), so we get e.g. firstprivate(v_n).

Then, when lowering the loop, its symbol table is merged, resulting in v_n being renamed to v_n_1 everywhere inside the loop, but not touching anywhere outside the loop (e.g. the data sharing clauses in the task AND parallel clauses). This means that the generated code is wrong, as the renamed v_n_1 is not in any data sharing clauses in any of the OpenMP clauses, so it falls back to our default(shared) behaviour, resulting in a race condition.

I think maybe we need to be smarter when inlining, but I'm still a bit unclear on how scoping works - I assume we have 3? symbol tables for a case like this (some shorthand):

import mymod, only: var1
subroutine mysub()
   integer :: var1
   integer :: i
   do i = 1, 10
      call mysub2(i)
   end do
end subroutine

subroutine mysub2(i)
   var1 = i
end subroutine

So we have a var1 in some Container symbol table, a var1 in the Routine symbol table and a var1 in the Loop symbol table (after inlining mysub2)

At the very least I think when inlining I think we should check the .scope symbol table upwards to the routine level, maybe even the container level? I'm not sure if its sensible to do all the upwards merging to routine level when inlining but it might be the only safe way to avoid issues with OpenMP clauses at the moment?

arporter commented 7 months ago

That's weird. By default, any symbols in outer symbol tables are in scope in any inner table. Therefore, inlining should account for all name clashes. So, in theory, it should work...if you could narrow down precisely where it is that would help.

LonelyCat124 commented 7 months ago

Hi Andy, I think the issue might be that the structure is two inlined functions with overlapping names:

do i = 1, 10
   call momentum_u(...) ! This contains v_n as a variable
end do

do i = 1, 10
   call momentum_v(...) ! This also contains v_n as a variable
end do

When these functions are inlined, they both find their scope to be the loop, so the symbols are added to that Loop's symbol table only in the apply function in inline_trans (since they look for node.scope.symbol_table). This then means the names are allowed to both exist at that time and only collide during lowering, when the Loop's individual symbol tables are merged into the routine.

The OpenMP clauses exist outside of those individual scopes, and since another symbol with the name they reference exists, PSyclone assumes the thing they're referencing to still be correct (I assume) and doesn't adjust them since they lie outside the scope being merged.

Does that make sense?

arporter commented 7 months ago

Ah! Yes, that makes sense. The OMP clauses need to get updated when the inlining happens. Perhaps using the new node-updating signaling mechanism.

LonelyCat124 commented 7 months ago

My concern is I don't know that the OpenMP clauses can know how to tell what they need to use since both the original and modified named symbols can appear inside the directive's schedule. Would it be bad to have the inline transformation search for a parent Routine and use that scope for inlining instead? Or is that somewhat bad practice in PSyclone?

arporter commented 7 months ago

Well, we try to avoid having to do that but sometimes we do resort to it yes. I can't remember how we generate the Clauses but could we just repeat it once the inlining is done (if it uses dep. analysis)?

LonelyCat124 commented 7 months ago

The task one doesn't use dep analysis - at the moment the task one essentially says "please don't modify the code underneath me once i've been lowered". I guess one thing that might work would be to lower all the children of the task directive before computing the clauses - though I suspect this may not fix the issue with the OMPParallelDirective clauses. I'll try that quickly and see if that resolves the issue for now.

LonelyCat124 commented 7 months ago

That doesn't resolve it because the merging of the symbol tables isn't happening in lowering but during code generation. I think it would be quite difficult to regenerate clauses at this point (partly as the task directive was designed in such a way that post-lowering we didn't think code changes would happen). I could force the task region to merge all child ScopingNode symbol tables with its parent parallel regions scope at lowering? Would this be a reasonable option?

arporter commented 7 months ago

Can we add the task directives after the inlining has been performed? I think maybe I need to chat through what's going on to fully understand the situation.

LonelyCat124 commented 7 months ago

Yeah that always will happen (though i'm doing it explicitly for now as there were some other problems with doing it after the ChunkedLoopTrans for NemoLite2D I think).

The steps I'm doing to hit this error (NemoLite2D) are:

Lower all Kerns to Calls
Inline all of the Calls
Apply ChunkedLoopTrans + OMPTaskTrans to the loops
Apply OMPParallelTrans

What seems to happen then in PSyclone is:

Lower all of the tree to language level - at this point the clauses on OpenMP directives are created / refreshed if needed
Codegen starts visiting nodes for codegen - when it reaches routine_node in the Fortran backend, there is code to do whole_routine_scope.merge(schedule.symbol_table) for every Schedule in the routine - at this point it renames a bunch of variables inside those scopes, but anything outside of those scopes (which OpenMP directives need to be by definition) doesn't understand the renames (and can't safely since the original name can be used inside the OpenMP directive still, so renaming isn't sufficient).

Happy to have a chat about it - let me know when works for you :)

LonelyCat124 commented 7 months ago

That doesn't resolve it because the merging of the symbol tables isn't happening in lowering but during code generation. I think it would be quite difficult to regenerate clauses at this point (partly as the task directive was designed in such a way that post-lowering we didn't think code changes would happen). I could force the task region to merge all child ScopingNode symbol tables with its parent parallel regions scope at lowering? Would this be a reasonable option?

This sort of works (I need to empty the SymbolTable after) but it is very "Fortran" to have to do this (local fields could remain local to that scope in other languages), but I don't have a good generic solution otherwise.

LonelyCat124 commented 7 months ago

Also my "fix" here only works for code with tasking, if we inlined and then applied other OpenMP directives we'd get the same issue, so this should probably be done during lowering of any OpenMP parallel directive.

arporter commented 7 months ago

Let me see if I can write a test that exercises this...

code = (
    "module psy_single_invoke_test\n"
    "  use field_mod, only: r2d_field\n"
    "  use kind_params_mod\n"
    "  implicit none\n"
    "  contains\n"
    "  subroutine invoke_0_compute_cu(cu_fld, pf, u_fld)\n"
    "    type(r2d_field), intent(inout) :: cu_fld, pf, u_fld\n"
    "    integer j, i, a_clash\n"
    "    do j = cu_fld%internal%ystart, cu_fld%internal%ystop, 1\n"
    "      do i = cu_fld%internal%xstart, cu_fld%internal%xstop, 1\n"
    "        a_clash = i\n"
    "        call compute_cu_code(a_clash, j, cu_fld%data, pf%data, "
    "u_fld%data)\n"
    "      end do\n"
    "    end do\n"
    "  end subroutine invoke_0_compute_cu\n"
    "  subroutine compute_cu_code(i, j, cu, p, u)\n"
    "    implicit none\n"
    "    integer,  intent(in) :: i, j\n"
    "    real(go_wp), intent(out), dimension(:,:) :: cu\n"
    "    real(go_wp), intent(in),  dimension(:,:) :: p, u\n"
    "    real(go_wp) :: a_clash\n"
    "    a_clash = 3.0d0\n"
    "    cu(i,j) = a_clash*0.5d0*(p(i,j)+p(i-1,j))*u(i,j)\n"
    "  end subroutine compute_cu_code\n"
    "end module psy_single_invoke_test\n"
)

Doing InlineTrans (after hacking the validate) and then OMPLoopTrans, the generated code is:

integer :: a_clash
real(kind=go_wp) :: a_clash_1

!$omp parallel do default(shared), private(a_clash_1,i,j), schedule(auto)
do j = cu_fld%internal%ystart, cu_fld%internal%ystop, 1
  do i = cu_fld%internal%xstart, cu_fld%internal%xstop, 1
    a_clash = i
    a_clash_1 = 3.0d0
    cu_fld%data(a_clash,j) = a_clash_1 * 0.5d0 * (pf%data(a_clash,j) + pf%data(a_clash - 1,j)) * u_fld%data(a_clash,j)
  enddo
enddo
!$omp end parallel do

so a_clash is now shared which is wrong.

LonelyCat124 commented 7 months ago

I also found a weird case when attempting to create a test:

def test_failure(fortran_reader, fortran_writer):
    '''Test with OMPParallelTrans and OMPLoopTrans'''
    code = '''subroutine to_inline(i, j, k)
        integer :: i , j, k
        integer :: a
        a = i * j
        k = a * a
    end subroutine

    subroutine main_sub()
       integer :: i, j, k
       integer, dimension(100) :: l

       do k = 1, 50
         do i = 1, 5
           do j = 1, 7
             call to_inline(i,j, l(k))
           end do
         end do
      end do

      do k = 51, 100
        do i = 2, 24
          do j = 3,21
            call to_inline(i,j, l(k))
          end do
        end do
      end do
    end subroutine
    '''
    psyir = fortran_reader.psyir_from_source(code)
    call_nodes = psyir.walk(Call)
    inline_trans = InlineTrans()
    main_sub = psyir.walk(Routine)[1]
    from psyclone.psyir.transformations import OMPLoopTrans
    from psyclone.transformations import OMPParallelTrans
    for call in call_nodes:
        inline_trans.apply(call)
    loopy = OMPLoopTrans()
    loopy.apply(main_sub.children[1])
    loopy.apply(main_sub.children[0])
    parallelt = OMPParallelTrans()
    parallelt.apply(main_sub.children[0].children[:])
    print(fortran_writer(psyir))
    assert False

I feel like this should be ok - we inline the loops, then add parallelism to the outermost loop followed by a parallel region around the whole code. This doesn't even reach codegen - when attempting to parallelise the first loop I hit an error: KeyError: "Could not find 'a' in the Symbol Table."

The call stack is

../../../psyir/transformations/omp_loop_trans.py:265: in apply
    super().apply(node, options)
../../../psyir/transformations/parallel_loop_trans.py:195: in apply
    self.validate(node, options=options)
../../../psyir/transformations/parallel_loop_trans.py:152: in validate
    if not node.independent_iterations(dep_tools=dep_tools,
../../../psyir/nodes/loop.py:424: in independent_iterations
    return dtools.can_loop_be_parallelised(
../../../psyir/tools/dependency_tools.py:751: in can_loop_be_parallelised
    symbol = symbol_table.lookup(var_name)

when calling loopy.apply(main_sub.children[1])

Edit: I think that this failure is a failure of dependency tools not considering child scopes of nested loops as opposed to the InlineTrans though, as fixing the issues we see for tasking doesn't prevent this occuring later.

arporter commented 7 months ago

Edit: I think that this failure is a failure of dependency tools not considering child scopes of nested loops as opposed to the InlineTrans though, as fixing the issues we see for tasking doesn't prevent this occuring later.

Great. That means we can farm it out to @hiker ;-) (possibly!)

arporter commented 7 months ago

I've pushed my simplified (now just generic PSyIR rather than GOcean) test to 2421_omp_clause_bug.

LonelyCat124 commented 7 months ago

I guess then we need to decide which fix we prefer.

Either we say this is InlineTrans job and it has to search upwards for a Routine to merge the symbol table with, or its an issue unique to OpenMP CPU Parallelism, in which case we make it "Fortran-specific" for now and have the OMP Parallel directive merge all child ScopingNode symbol tables during lowering (before clauses are computed/updated).

I guess one question I have is can the lowering function know what language the output is going to be and only do it if its going to be Fortran? That feels like something that lowering probably doesn't/shouldn't know/care about? But the clauses and languages do have slightly different rules here in the standard sometimes anyway.

arporter commented 7 months ago

I'm hoping that we can fix the OpenMP directive but I haven't yet got to the bit of code that's responsible for generating the clauses. It feels as though it is simply not looking at all of the SymbolTables that it contains?

LonelyCat124 commented 7 months ago

Well its complicated because if we're being language-independent it shouldn't have to, as by the OpenMP standard variables declared inside an OpenMP region are considered to be private automatically (since they have no existence outside the region anyway its fine).

In Fortran that concept doesn't exist (local scopes) so I'm not sure what is more sensible to do.

arporter commented 7 months ago

Oooh, that is complicated. I think part of the problem is that the list of variables to put in the private clause is obtained using the Node.reference_accesses method and that flattens References to Signatures and assumes that if two Signatures have the same name then they are the same variable. I'm beginning to think that your first suggestion of being picky about which symbol table InlineTrans adds things to is the simplest way forward.

However, this is also a bug in reference_accesses that we somehow need to fix. I've created #2424 for this.

arporter commented 7 months ago

Lower all of the tree to language level - at this point the clauses on OpenMP directives are created / refreshed if needed

Codegen starts visiting nodes for codegen - when it reaches routine_node in the Fortran backend, there is code to do whole_routine_scope.merge(schedule.symbol_table) for every Schedule in the routine - at this point it renames a bunch of variables inside those scopes, but anything outside of those scopes (which OpenMP directives need to be by definition) doesn't understand the renames (and can't safely since the original name can be used inside the OpenMP directive still, so renaming isn't sufficient).

You were quite right about this (it just took me a long while to understand). In theory, the renaming should be sufficient because the clauses on the OMP directive should refer to Symbols (that retain their identity while being renamed). However, in OMPParallelDirective.infer_sharing_attributes we go from Symbols -> Signatures -> Symbols and thus lose any distinct Symbols that are in different tables but happen to have the same name. This needs to be solved by #2424.

LonelyCat124 commented 7 months ago

I've been trying a bit to do inlining + OpenMP loop transformation (weird performance results) and I cannot get this to work at all. I attempted to have InlineTrans merge the scopes at a higher level (to avoid the dependence analysis failures), but this fails due to it believing one of the symbols is an argument (I think?) to a Call and refuses/believes its impossible to rename either of the colliding symbols.

I'm not quite sure why this then doesn't cause issues the way it works at the moment (where its inlined to a local scope and then merged to the routine scope later) for the tasking transformation, maybe something happens at some point during lowering/? that converts them into standard symbols.

arporter commented 7 months ago

I attempted to have InlineTrans merge the scopes at a higher level (to avoid the dependence analysis failures), but this fails due to it believing one of the symbols is an argument (I think?) to a Call and refuses/believes its impossible to rename either of the colliding symbols.

By this do you mean you've changed InlineTrans so that it adds new Symbols to the the table of the Routine it is inlining into?

LonelyCat124 commented 7 months ago

I attempted to have InlineTrans merge the scopes at a higher level (to avoid the dependence analysis failures), but this fails due to it believing one of the symbols is an argument (I think?) to a Call and refuses/believes its impossible to rename either of the colliding symbols.

By this do you mean you've changed InlineTrans so that it adds new Symbols to the the table of the Routine it is inlining into?

Yes I attempted to - it was very naive in that all I did was change table = node.scope.symbol_table (line 136, 2nd line in apply) to table = node.ancestor(Routine).scope.symbol_table but then if I remember correctly the table.merge at line 161 was then failing.

LonelyCat124 commented 7 months ago

@arporter I realised my error when talking to Rupert yesterday. I was trying to merge the symbol tables early (before inlining was complete), which was failing as it was attempting to merge symbol tables of a symbol used in a Call, which is not allowed. This is basically failing because the call is still present in the code, so I'd suggest instead a 2-step process:

Do the current table.marge as it currently takes place.
Perform the rest of the inlining operation.
Once the inlining is complete and the Call has been replaced, merge the node.scope.symbol_table with the node.ancestor(Routine).scope.symbol_table and empty the scope's symbol table.

Edit: I added changes like this into PSyclone to test this with NemoLite2D and it allows me to do inlining + OMPLoopTrans (though it requires {force: True} for one of the loops due to it being a "race condition", which Sergi says probably is not real). I can make a PR for this if you think this is a reasonable solution?

LonelyCat124 commented 2 months ago

@arporter I think this is the issue I mentioned yesterday.

stfc / PSyclone

`OMPTaskTrans` / `InliningTrans` / ? resulting in incorrect data sharing clauses #2421