Open LonelyCat124 opened 7 months ago
@arporter I'm debugging this and I'm a bit confused how the renaming/inlining works slash I think it produces incorrect code with OpenMP.
I'd assumed that when we inline code we would rename any symbols that collide at the time we inline (line 162 in inline_trans does a table.merge
). However, I'm not sure this works how I expected, as it looks for the containing scoping region (in this case a loop) and only merge's into the loop's symbol table (I think).
I've then generated clauses (Parallel, task) which use the inlined variable names (which are still the original ones because the containing loop has an empty symbol table), so we get e.g. firstprivate(v_n)
.
Then, when lowering the loop, its symbol table is merged, resulting in v_n
being renamed to v_n_1
everywhere inside the loop, but not touching anywhere outside the loop (e.g. the data sharing clauses in the task AND parallel clauses).
This means that the generated code is wrong, as the renamed v_n_1
is not in any data sharing clauses in any of the OpenMP clauses, so it falls back to our default(shared)
behaviour, resulting in a race condition.
I think maybe we need to be smarter when inlining, but I'm still a bit unclear on how scoping works - I assume we have 3? symbol tables for a case like this (some shorthand):
import mymod, only: var1
subroutine mysub()
integer :: var1
integer :: i
do i = 1, 10
call mysub2(i)
end do
end subroutine
subroutine mysub2(i)
var1 = i
end subroutine
So we have a var1 in some Container
symbol table, a var1 in the Routine
symbol table and a var1 in the Loop
symbol table (after inlining mysub2)
At the very least I think when inlining I think we should check the .scope
symbol table upwards to the routine level, maybe even the container level? I'm not sure if its sensible to do all the upwards merging to routine level when inlining but it might be the only safe way to avoid issues with OpenMP clauses at the moment?
That's weird. By default, any symbols in outer symbol tables are in scope in any inner table. Therefore, inlining should account for all name clashes. So, in theory, it should work...if you could narrow down precisely where it is that would help.
Hi Andy, I think the issue might be that the structure is two inlined functions with overlapping names:
do i = 1, 10
call momentum_u(...) ! This contains v_n as a variable
end do
do i = 1, 10
call momentum_v(...) ! This also contains v_n as a variable
end do
When these functions are inlined, they both find their scope to be the loop, so the symbols are added to that Loop's symbol table only in the apply
function in inline_trans (since they look for node.scope.symbol_table
). This then means the names are allowed to both exist at that time and only collide during lowering, when the Loop's individual symbol tables are merged into the routine.
The OpenMP clauses exist outside of those individual scopes, and since another symbol with the name they reference exists, PSyclone assumes the thing they're referencing to still be correct (I assume) and doesn't adjust them since they lie outside the scope being merged.
Does that make sense?
Ah! Yes, that makes sense. The OMP clauses need to get updated when the inlining happens. Perhaps using the new node-updating signaling mechanism.
My concern is I don't know that the OpenMP clauses can know how to tell what they need to use since both the original and modified named symbols can appear inside the directive's schedule. Would it be bad to have the inline transformation search for a parent Routine and use that scope for inlining instead? Or is that somewhat bad practice in PSyclone?
Well, we try to avoid having to do that but sometimes we do resort to it yes. I can't remember how we generate the Clauses but could we just repeat it once the inlining is done (if it uses dep. analysis)?
The task one doesn't use dep analysis - at the moment the task one essentially says "please don't modify the code underneath me once i've been lowered". I guess one thing that might work would be to lower all the children of the task directive before computing the clauses - though I suspect this may not fix the issue with the OMPParallelDirective clauses. I'll try that quickly and see if that resolves the issue for now.
That doesn't resolve it because the merging of the symbol tables isn't happening in lowering but during code generation. I think it would be quite difficult to regenerate clauses at this point (partly as the task directive was designed in such a way that post-lowering we didn't think code changes would happen).
I could force the task region to merge all child ScopingNode
symbol tables with its parent parallel regions scope at lowering? Would this be a reasonable option?
Can we add the task directives after the inlining has been performed? I think maybe I need to chat through what's going on to fully understand the situation.
Yeah that always will happen (though i'm doing it explicitly for now as there were some other problems with doing it after the ChunkedLoopTrans for NemoLite2D I think).
The steps I'm doing to hit this error (NemoLite2D) are:
What seems to happen then in PSyclone is:
routine_node
in the Fortran backend, there is code to do whole_routine_scope.merge(schedule.symbol_table)
for every Schedule in the routine - at this point it renames a bunch of variables inside those scopes, but anything outside of those scopes (which OpenMP directives need to be by definition) doesn't understand the renames (and can't safely since the original name can be used inside the OpenMP directive still, so renaming isn't sufficient).Happy to have a chat about it - let me know when works for you :)
That doesn't resolve it because the merging of the symbol tables isn't happening in lowering but during code generation. I think it would be quite difficult to regenerate clauses at this point (partly as the task directive was designed in such a way that post-lowering we didn't think code changes would happen). I could force the task region to merge all child
ScopingNode
symbol tables with its parent parallel regions scope at lowering? Would this be a reasonable option?
This sort of works (I need to empty the SymbolTable after) but it is very "Fortran" to have to do this (local fields could remain local to that scope in other languages), but I don't have a good generic solution otherwise.
Also my "fix" here only works for code with tasking, if we inlined and then applied other OpenMP directives we'd get the same issue, so this should probably be done during lowering of any OpenMP parallel directive.
Let me see if I can write a test that exercises this...
code = (
"module psy_single_invoke_test\n"
" use field_mod, only: r2d_field\n"
" use kind_params_mod\n"
" implicit none\n"
" contains\n"
" subroutine invoke_0_compute_cu(cu_fld, pf, u_fld)\n"
" type(r2d_field), intent(inout) :: cu_fld, pf, u_fld\n"
" integer j, i, a_clash\n"
" do j = cu_fld%internal%ystart, cu_fld%internal%ystop, 1\n"
" do i = cu_fld%internal%xstart, cu_fld%internal%xstop, 1\n"
" a_clash = i\n"
" call compute_cu_code(a_clash, j, cu_fld%data, pf%data, "
"u_fld%data)\n"
" end do\n"
" end do\n"
" end subroutine invoke_0_compute_cu\n"
" subroutine compute_cu_code(i, j, cu, p, u)\n"
" implicit none\n"
" integer, intent(in) :: i, j\n"
" real(go_wp), intent(out), dimension(:,:) :: cu\n"
" real(go_wp), intent(in), dimension(:,:) :: p, u\n"
" real(go_wp) :: a_clash\n"
" a_clash = 3.0d0\n"
" cu(i,j) = a_clash*0.5d0*(p(i,j)+p(i-1,j))*u(i,j)\n"
" end subroutine compute_cu_code\n"
"end module psy_single_invoke_test\n"
)
Doing InlineTrans (after hacking the validate
) and then OMPLoopTrans, the generated code is:
integer :: a_clash
real(kind=go_wp) :: a_clash_1
!$omp parallel do default(shared), private(a_clash_1,i,j), schedule(auto)
do j = cu_fld%internal%ystart, cu_fld%internal%ystop, 1
do i = cu_fld%internal%xstart, cu_fld%internal%xstop, 1
a_clash = i
a_clash_1 = 3.0d0
cu_fld%data(a_clash,j) = a_clash_1 * 0.5d0 * (pf%data(a_clash,j) + pf%data(a_clash - 1,j)) * u_fld%data(a_clash,j)
enddo
enddo
!$omp end parallel do
so a_clash
is now shared
which is wrong.
I also found a weird case when attempting to create a test:
def test_failure(fortran_reader, fortran_writer):
'''Test with OMPParallelTrans and OMPLoopTrans'''
code = '''subroutine to_inline(i, j, k)
integer :: i , j, k
integer :: a
a = i * j
k = a * a
end subroutine
subroutine main_sub()
integer :: i, j, k
integer, dimension(100) :: l
do k = 1, 50
do i = 1, 5
do j = 1, 7
call to_inline(i,j, l(k))
end do
end do
end do
do k = 51, 100
do i = 2, 24
do j = 3,21
call to_inline(i,j, l(k))
end do
end do
end do
end subroutine
'''
psyir = fortran_reader.psyir_from_source(code)
call_nodes = psyir.walk(Call)
inline_trans = InlineTrans()
main_sub = psyir.walk(Routine)[1]
from psyclone.psyir.transformations import OMPLoopTrans
from psyclone.transformations import OMPParallelTrans
for call in call_nodes:
inline_trans.apply(call)
loopy = OMPLoopTrans()
loopy.apply(main_sub.children[1])
loopy.apply(main_sub.children[0])
parallelt = OMPParallelTrans()
parallelt.apply(main_sub.children[0].children[:])
print(fortran_writer(psyir))
assert False
I feel like this should be ok - we inline the loops, then add parallelism to the outermost loop followed by a parallel region around the whole code. This doesn't even reach codegen - when attempting to parallelise the first loop I hit an error: KeyError: "Could not find 'a' in the Symbol Table."
The call stack is
../../../psyir/transformations/omp_loop_trans.py:265: in apply
super().apply(node, options)
../../../psyir/transformations/parallel_loop_trans.py:195: in apply
self.validate(node, options=options)
../../../psyir/transformations/parallel_loop_trans.py:152: in validate
if not node.independent_iterations(dep_tools=dep_tools,
../../../psyir/nodes/loop.py:424: in independent_iterations
return dtools.can_loop_be_parallelised(
../../../psyir/tools/dependency_tools.py:751: in can_loop_be_parallelised
symbol = symbol_table.lookup(var_name)
when calling loopy.apply(main_sub.children[1])
Edit: I think that this failure is a failure of dependency tools not considering child scopes of nested loops as opposed to the InlineTrans
though, as fixing the issues we see for tasking doesn't prevent this occuring later.
Edit: I think that this failure is a failure of dependency tools not considering child scopes of nested loops as opposed to the
InlineTrans
though, as fixing the issues we see for tasking doesn't prevent this occuring later.
Great. That means we can farm it out to @hiker ;-) (possibly!)
I've pushed my simplified (now just generic PSyIR rather than GOcean) test to 2421_omp_clause_bug
.
I guess then we need to decide which fix we prefer.
Either we say this is InlineTrans job and it has to search upwards for a Routine to merge the symbol table with, or its an issue unique to OpenMP CPU Parallelism, in which case we make it "Fortran-specific" for now and have the OMP Parallel directive merge all child ScopingNode symbol tables during lowering (before clauses are computed/updated).
I guess one question I have is can the lowering function know what language the output is going to be and only do it if its going to be Fortran? That feels like something that lowering probably doesn't/shouldn't know/care about? But the clauses and languages do have slightly different rules here in the standard sometimes anyway.
I'm hoping that we can fix the OpenMP directive but I haven't yet got to the bit of code that's responsible for generating the clauses. It feels as though it is simply not looking at all of the SymbolTables that it contains?
Well its complicated because if we're being language-independent it shouldn't have to, as by the OpenMP standard variables declared inside an OpenMP region are considered to be private automatically (since they have no existence outside the region anyway its fine).
In Fortran that concept doesn't exist (local scopes) so I'm not sure what is more sensible to do.
Oooh, that is complicated. I think part of the problem is that the list of variables to put in the private clause is obtained using the Node.reference_accesses
method and that flattens References to Signatures and assumes that if two Signatures have the same name then they are the same variable. I'm beginning to think that your first suggestion of being picky about which symbol table InlineTrans adds things to is the simplest way forward.
However, this is also a bug in reference_accesses
that we somehow need to fix. I've created #2424 for this.
- Lower all of the tree to language level - at this point the clauses on OpenMP directives are created / refreshed if needed
- Codegen starts visiting nodes for codegen - when it reaches routine_node in the Fortran backend, there is code to do whole_routine_scope.merge(schedule.symbol_table) for every Schedule in the routine - at this point it renames a bunch of variables inside those scopes, but anything outside of those scopes (which OpenMP directives need to be by definition) doesn't understand the renames (and can't safely since the original name can be used inside the OpenMP directive still, so renaming isn't sufficient).
You were quite right about this (it just took me a long while to understand). In theory, the renaming should be sufficient because the clauses on the OMP directive should refer to Symbols (that retain their identity while being renamed). However, in OMPParallelDirective.infer_sharing_attributes
we go from Symbols -> Signatures -> Symbols and thus lose any distinct Symbols that are in different tables but happen to have the same name. This needs to be solved by #2424.
I've been trying a bit to do inlining + OpenMP loop transformation (weird performance results) and I cannot get this to work at all. I attempted to have InlineTrans merge the scopes at a higher level (to avoid the dependence analysis failures), but this fails due to it believing one of the symbols is an argument (I think?) to a Call and refuses/believes its impossible to rename either of the colliding symbols.
I'm not quite sure why this then doesn't cause issues the way it works at the moment (where its inlined to a local scope and then merged to the routine scope later) for the tasking transformation, maybe something happens at some point during lowering/? that converts them into standard symbols.
I attempted to have InlineTrans merge the scopes at a higher level (to avoid the dependence analysis failures), but this fails due to it believing one of the symbols is an argument (I think?) to a Call and refuses/believes its impossible to rename either of the colliding symbols.
By this do you mean you've changed InlineTrans so that it adds new Symbols to the the table of the Routine it is inlining into?
I attempted to have InlineTrans merge the scopes at a higher level (to avoid the dependence analysis failures), but this fails due to it believing one of the symbols is an argument (I think?) to a Call and refuses/believes its impossible to rename either of the colliding symbols.
By this do you mean you've changed InlineTrans so that it adds new Symbols to the the table of the Routine it is inlining into?
Yes I attempted to - it was very naive in that all I did was change table = node.scope.symbol_table
(line 136, 2nd line in apply) to table = node.ancestor(Routine).scope.symbol_table
but then if I remember correctly the table.merge
at line 161 was then failing.
@arporter I realised my error when talking to Rupert yesterday.
I was trying to merge the symbol tables early (before inlining was complete), which was failing as it was attempting to merge symbol tables of a symbol used in a Call
, which is not allowed. This is basically failing because the call is still present in the code, so I'd suggest instead a 2-step process:
table.marge
as it currently takes place.Call
has been replaced, merge the node.scope.symbol_table
with the node.ancestor(Routine).scope.symbol_table
and empty the scope's symbol table.Edit: I added changes like this into PSyclone to test this with NemoLite2D and it allows me to do inlining + OMPLoopTrans (though it requires {force: True}
for one of the loops due to it being a "race condition", which Sergi says probably is not real).
I can make a PR for this if you think this is a reasonable solution?
@arporter I think this is the issue I mentioned yesterday.
When running OMPTaskTrans (with manually predone inlining trans) there's an issue with renamed variables not appearing in data sharing clauses.
When inlining momentum_u_code we get the expected behaviour:
firstprivate(j_out_var,j_el_inner,u_e,depe,u_w,depw,v_sc,v_s,deps,v_nc,**v_n**,depn,
(there's an argument over whether these should be firstprivate or not but thats not really important for now).When then inlining momentum_v_code, the inlining transformation renames
v_n
tov_n_1
to prevent collision (makes sense), however the private/firstprivate clauses (and dependency clauses) on this task still usev_n
instead ofv_n_1
.I think this is happening as the parent parallel clause doesn't seem to know anything about
v_n_1
, and thus it is considered to beshared
by the tasks, so shouldn't be declared private or firstprivate, however this doesn't explain why the task's clauses containv_n
which is never used in the task, so shouldn't be inside the clauses unless renaming is happening later than I expect?