stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
107 stars 28 forks source link

Better dependency analysis #1417

Open LonelyCat124 opened 3 years ago

LonelyCat124 commented 3 years ago

Summary:

  1. The self.sameParent(node) check in forward_dependency() limits the available dependency structure (and can easily cause invalid dependencies for models such as OpenMP).
  2. Removing the limitation on depth etc. for forward_dependency I think is important - it should only disallow children or ancestors of the node.
  3. Allowing users to specify the types of node to look for as dependencies would be a good solution, i.e. a user could specify node.forward_dependence( (OMPTaskloopDirective, OMPLoopDirective, Loop) ) and it would find the first dependency for which instanceof(forward_dependent, (OMPTaskloopDirective, OMPLoopDirective, Loop) ) returns True.

2&3 would satisfy the requirements I can think of for OpenMP at least.

Applying loop transformations seems to lose dependency information, e.g.

 assert (schedule1.children[0].forward_dependence() is schedule1.children[1])
 before = schedule1.children[0]   
 tloop.apply(schedule1.children[0])
assert (before.forward_dependence() is schedule1.children[1])

This is lost due to before being deeper in the tree than schedule1.children[1], if I apply the tloop to both nodes:

 assert (schedule1.children[0].forward_dependence() is schedule1.children[1])
 before = schedule1.children[0]
 before1 = schedule1.children[1]
 tloop.apply(schedule1.children[0])
 tloop.apply(schedule1.children[1])
 assert before.forward_dependence() is before1

This still doesn't work (before.forward_dependence() is still None).

However, if I do

    before1 = schedule1.children[1]
    tloop.apply(schedule1.children[0])
    assert schedule1.children[0].forward_dependence() is before1

This works fine again (because they're at the same level? I assume).

If I apply a the same transformation to schedule1.children[1] the forward_dependence switches to point to the new node correctly.

    a = schedule1.children[0]                                                                                                                                                  
   tloop.apply(schedule1.children[1])
    b = schedule1.children[1]
    assert a.forward_dependence() is b

The issue I'm then having, is that when applying further transformations to these nodes, the dependency information disappears:

    assert a.forward_dependence() is b
    sing.apply(schedule1.children[0])
    sing.apply(schedule1.children[1])                                                                                                                                          
   trans.apply(schedule1.children[0])
    trans.apply(schedule1.children[1])
    assert a.forward_dependence() is b

The last assert fails here, and also here:

    assert a.forward_dependence() is b
    sing.apply(schedule1.children[0:1])
    trans.apply(schedule1.children[0:1])                                                                                                                                       
    assert a.forward_dependence() is b

I think this is just due to me using apply wrong - it works for apply(schedule.children)

In the former case, this is safe for these specific transformations, as there are now barriers between a and b. However, it is easy to create an example where this is not safe:

sing = OMPSingleTrans(nowait=True)
...
sing.apply(schedule1.children[0])
sing.apply(schedule1.children[1])
trans.apply(schedule1.children)

I tried looking through the code to see why this dependency disappears (for the case where I apply to schedule.children[0:1]), and after the later transformations are applied, inside forward_dependence() this line (node.py:692) finds dependent_arg = cu_fld which is as expected (I'm not sure why the other args are None but not an issue for now)

The two nodes (self and node) both have depth 7, are not the same node, and are both OMPTaskloopDirective nodes, however this breaks because the schedule containing those nodes are not the same, and thus the self.sameParent(node) check fails, though as far as OpenMP cares they exist within the same barrier so are dependent (if you ran this code with those transformations without the taskloop transformations you would already have a race condition).

My proposal on how to fix this would be (as a temporary fix) probably to replace the self.sameParent(node) with a check to see if they are contained within the same child of the root (or child of child of the root? Essentially if they share a highest-level non-Schedule ancestor, be that an Invoke, OMPParallelDirective, ACCParallelDirective, etc.). I don't know what else this might break though, why does it check at the moment if they have the same parent?

What I need ideally for dependency tracking is the next-non-child and non-ancestor node which has a shared dependency, i.e. for some tree:

A -> B -> C
       -> D
   -> E -> F
   -> G -> H
J -> K -> L

B's next dependents could be any of E, G, J or K C's next dependents could be any of D, E, F, G, H, J, K, L

I was trying to work out if any requirement on >= depth would be helpful, but I don't think it would since we could have:

!$omp parallel
!$omp do
  LOOP_A
!$omp end do
!$omp single
!$omp taskloop
  LOOP_B
 !$omp end.....

And in this case LOOP_A has depth 3 whilst LOOP_B has higher depth. It would probably be useful to be able to specify the type of node we care about, e.g. node.forward_dependence( (OMPLoopTrans, OMPTaskloopTrans, ...) ) which finds the next of a specific type of instance to look for (which helps restrict what depth things can appear at, so e.g. an OMPLoopDirective doesn't return a following OMPParallelDirective if you restrict it, but rather whichever of the OMPParallelDirective's first children causes that dependency).

hiker commented 3 years ago

Could you take a step back and give a quick example of what you want to detect in the first place, independent of existing calls? We have a new dependency analysis already implemented (mostly ;) ) - if I could quickly see what you want to detect I could provide some guidance (if the new one can already do what you want, or if it can be extended, or ... if we need something new).

LonelyCat124 commented 3 years ago

The most obvious to me right now is the use case in #1415 In this, there is some number of OMPTaskloopDirective nodes, which may have some forward dependence to another node. Usually these dependencies will be to another OMP[Task]?LoopDirective, though there is no requirement on that. To be able to generate legal OpenMP code (i.e. no race conditions) it needs to be able to do the following:

  1. Ensure no dependencies occur between loops in separate OMPSingleDirective(nowait=True) or OMPMasterDirective subtrees.
  2. Ignore dependencies that occur between loops in separate OMPParallelDirective subtrees. I could just ignore these, but it would be preferable to be able to check this I think.
  3. Insert a (minimal) set of OMPTaskwaitDirective nodes to avoid race conditions. This is done by search for forward_dependences of each OMPTaskloopDirective and then removing superflous ones (since a single OMPTaskwaitDirective satisfies all dependencies that span it).

It seems like removing the self.sameParent(node) has resolved my immediate issues (or at least, my tests are beginning to pass), but this may not continue if I do more complicated systems.

Edit: Also in theory I should do something different if there is no nogroup clause on the OMPTaskloopDirective but I've not yet looked into that.

hiker commented 3 years ago

I don't have enough time today (NGMS Champions meeting), but I think the new DA might be quite useful. It is documented at https://psyclone-dev.readthedocs.io/en/latest/psyir_symbols.html#variable-accesses (and yes, it's hard to find, I have a ticket to move that around). It works somewhat different from the old code that was focused on getting next accesses. Basically, you can get a list of all variable accesses in a region. So if you consider your two region, you could do something like this:

va1 = VariablesAccessesInfo(region1_psyir)
all_sigs1 = va1.all_signatures    # See details in docs, for non-structured types a signature is basically just the variable name
va2 = VariablesAccessesInfo(region2_psyir)
all_sigs2 = va2.all_signatures
for sig1in all_sigs2:
    if va1 not in all_sigs2:
        # this variable is not at all used in the second loop and can be ignored
        continue
    # Otherwise this variable is used in both loops: get all access to this variable in each loop:
    access1 = va1[sig1]
    access2 = va2[sig1]
    if access1.is_read_only() and access2.is_read_only()
        # Variable is only read in both loops, it can be ignored
    if access1.is_written():
        # this variable is written in the first loop, and used (read or write, that shouldn't matter) in the 2nd loop
        # we need a wait between
        do taskwait here or set flag or whatever
    .... # Similar checks for read followed by write, and write-write .... though I have the feeling in all these cases a wait would be required.

You get the idea. There are some convenience functions missing (I am working on that). The accesses are sorted and have a statement number, so you can test if there is a read before a write etc (that's one place where we need something more convenient).

Also a caveat: if statements are not 'properly' handled, the information is collected as if both branches are executed. Typically that's fine, but if you have something like (pseudo-code):

   if something then
        a = b
    else
       b = a

you will get that there is a write access to a followed by a read access. At this stage you would have to manually detect and handle if-statements if required.

Am happy to provide more details. I will soon be getting back to this and start implementing more functionality. There is also a DepdendencyTool class that provides some higher level functionality, that's the best place to look at some examples. I thought I had a bit more on a branch somewhere (related to fusing loops, i.e. exactly the case where we have to handle two loops - all examples in the DepndencyTools handle only a single loop iirc), but I can't find that atm :)

LonelyCat124 commented 3 years ago

I think now that (fingers crossed) that draft PR is fully tested and "complete", I'll take a look at the new dependence analysis and see if that fits better.

LonelyCat124 commented 3 years ago

I had a look, and I think I can implement a forward_dependence style algorithm using a tree-walk for my own use-case.

One interesting type of Node that PSyclone has no concept of right now is a barrier, which in my opinion is a useful object for for parallel code dependence analysis. An easy example of how barriers change dependencies is this use case:

#pragma omp parallel
{
  #pragma omp single nowait
  {
    #pragma omp taskloop nogroup
    for(i=0; i < n; i++){
      a[i] = b[i] + 2;
    }
  }
  #pragma omp for
  for(i = 0; i < n; i++){
    a[i] *= 2;
  }
}

This code has a race condition, because nothing handles the dependency between the taskloop and the parallel for loop. There are two ways to handle this dependency (I think? I'm trying to double check that the first solution works):

  1. Delete the nowait clause from the single pragma. This adds an implicit barrier (which I think satisfies the dependency, i.e. the taskloops dependency now points at either the single or None)
  2. Add another barrier (either taskwait or ?barrier?) at either the end of the single region (taskwait) or after the end of the singleregion (barrier)

A potentially useful Mixin class might be a Barrier which has special rules associated with it (e.g. for any set of dependency arguments a Barrier counts as a forward_dependency or backward_dependency for all of them). This is not useful for low-level loop dependencies (i.e. for operations like fusing loops), but is potentially a valuable restriction for asynchronous parallel constructs (e.g. tasks and maybe some GPU operations).

Edit: For the openMP example above, only removing the nowait, removing the nogroup or adding an explicit omp barrier resolves the dependency, a taskwait does not. I'm not sure whether this is specified in OpenMP or is just an artefact of the compiler I'm using.

LonelyCat124 commented 3 years ago

@hiker in your example you do for sig1 in all_sigs2 - should that be so or should it be for sig1 in all_sigs1?

My guess is it should do:

for sig1 in all_sigs1:
  if sig1 not in all_sigs2:
    continue
...

Or do i misunderstand how this works?

LonelyCat124 commented 3 years ago

Ok, I have an implementation using this, but it is oversensitive right now as it catches things like loop variables (which are private, so are false dependencies). I probably need instead to loop through the children of my nodes and get the lists of all their non-Loop/etc. children's accesses and join those together in some way. I guess something like

vai = VariablesAccessInfo()
for child in taskloop.walk(Node):
  if not isinstance(child, Loop) and child is not taskloop:
    vai.merge(VariablesAccessInfo(child))

I will try doing something like this and see how it works.

Edit: This seems to work...:o

hiker commented 3 years ago

@hiker in your example you do for sig1 in all_sigs2 - should that be so or should it be for sig1 in all_sigs1?

My guess is it should do:

for sig1 in all_sigs1:
  if sig1 not in all_sigs2:
    continue
...

Or do i misunderstand how this works?

You are right, that's Indeed a typo. I didn't test the code :)

hiker commented 3 years ago

Ok, I have an implementation using this, but it is oversensitive right now as it catches things like loop variables (which are private, so are false dependencies). I probably need instead to loop through the children of my nodes and get the lists of all their non-Loop/etc. children's accesses and join those together in some way. I guess something like

vai = VariablesAccessInfo()
for child in taskloop.walk(Node):
  if not isinstance(child, Loop) and child is not taskloop:
    vai.merge(VariablesAccessInfo(child))

I will try doing something like this and see how it works.

Edit: This seems to work...:o

I can't commit about the taskloop, directive are not (yet) supported: the DA was meant to be used to detect (eg.) which omp directives to use. My best idea for this would indeed be to create code sections, and assemble the data access for each section.

Here what I do for loop variables (in the dependency tools), when analysing variables to see if they prevent loop parallelization:

        # Get the list of all loop indices
        loop_vars = [loop.variable.name for loop in loop.walk(Loop)]
        # Now check all variables used in the loop
        for signature in var_accesses.all_signatures:
            # This string contains derived type information, e.g.
            # "a%b"
            var_string = str(signature)
            # Ignore all loop variables - they look like reductions because of
            # the write-read access in the loop:
            if var_string in loop_vars:
                continue
            ...

I also should point out that the new DA is at a somewhat 'lower' level than the 'forward_dependence` approach: this DA gets all implicit arguments that PSyclone adds for LFRic (which can sometimes be surprising). I intend to make this optional later, once all of LFRic is using PSyIR.