stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
104 stars 28 forks source link

[Transformations] Add OpenMP tasking transformations. #1338

Open LonelyCat124 opened 3 years ago

LonelyCat124 commented 3 years ago

As part of Excalibur we need to add the ability to generate OpenMP task code.

I think there are a few likely requirements to have this working, some of which I'm happy to at least look at.

I discussed with Rupert as a first idea to create a transformation for the OMP TASKLOOP directive. This is probably almost a drop in replacement for omp parallel do, except I expect would need the addition of a parallel node and a taskloop node.

The one to use explicit tasks is more complex. The nemolite2d code blocks the outer loops:

#define TASK_SIZE some_integer_value

for(jj = internal_start; jj <= internal_stop; j+= TASK_SIZE){
  int end = min(jj+TASK_SIZE, internal_stop)
  --TASK PRAGMA HERE, ff is firstprivate
  for(j=jj; j <= end; j++){
  --Inner loop(s) and kernel(s).
  }
}

I will hopefully have some time to look at the taskloop directive soon.

LonelyCat124 commented 3 years ago

As a question - The OMPLoopTrans is separate from the code that generates the OpenMP parallel code?

If so, I think I should be able to create OMPTaskloopTrans as a drop-in replacement for OMPLoopTrans as a starting point to see how these things fit together.

arporter commented 3 years ago

I think the answer to your question is "yes" - it is separate. However, there's a check in the code-generation that the resulting OMP directive is within a parallel region.

LonelyCat124 commented 3 years ago

Ok - I'm going to work through the documentation to work out how things fit together tomorrow and go from there.

LonelyCat124 commented 3 years ago

Ok - I realised I need another thing that doesn't yet exist - which is OpenMP's single/master directives, as well as taskwait.

From what I can work out/understand it looks like I could create as many transformations as needed to be applied inside a code, and then just do them in innermost first order? So for example

    from psyclone.parse.algorithm import parse
    from psyclone.parse.utils import ParseError
    from psyclone.psyGen import PSyFactory
    from psyclone.errors import GenerationError
    api = "gocean1.0"
    filename = "nemolite2d_alg.f90"
    ast, invokeInfo = parse(filename, api=api, invoke_name="invoke")
    psy = PSyFactory(api).create(invokeInfo)
    print psy.invokes.names

    from psyclone.psyGen import TransInfo
    t = TransInfo()
    ltrans = t.get_trans_name('OMPTaskloopTrans')
    single_trans = t.get_trans_name('OMPSingleTrans')
    parallel_trans = t.get_trans_name('OMPParallelTrans')

    schedule = psy.invokes.get('invoke_0').schedule
    schedule.view()
    new_schedule = schedule

    for child in schedule.children:
        newschedule, memento = ltrans.apply(child, reprod=True)
        schedule = newschedule

    #Enclose all of these loops within a single OpenMP SINGLE region
    ##Original code here has rtrans.omp_schedule here - how is this applied to the loops?
    newschedule, memento = single_trans.apply(schedule.children)
    ##Enclose the SINGLE in an OpenMP parallel region
    newschedule, memento = parallel_trans.apply(schedule.children)

The code in the documentation has rtrans.omp_schedule("dynamic, 1") as a line of code that I don't quite understand - my assumption in that code is that rtrans is a trasnformation of type OMPParallelTrans, which has no omp_schedule function that I can see. Should that instead be ltrans.omp_schedule(...)?

The other alternative would be to create a OMPParallelSingleTrans but I think this is less useful, especially given our discussions already expect to mix tasking and loop-based parallelism realistically.

Another thing is (similar to how loops can have schedule associated), is that this single parameter may want to have nowait attached sometimes. I assume I can treat this similar to the omp_schedule part of an OMPLoopTrans?

The more complex part I'm not sure how would/should be applied is taskwait barriers. With taskloop these can be used instead of tasks' with dependencies in a course grained way.

My base assumption is that for a given loop, all of the operations are independent (i.e. Doing index i+3 before index i will not be wrong, merely break reproducibility).

What I then need to be able to do is use the PSyclone computed dependencies between loops, so if I had 3 loops

A(a,b)
B(c,d)
C(a,c)

with read/write dependencies on the first/second arguments, then I'd aim to generate something like

#pragma omp parallel
  #pragma omp single
  {
    #pragma omp taskloop grainsize(...)
         DO i=1, n
           A(a,b)
         END DO
    #pragma omp taskloop grainsize(...)
        DO i=1, n
          B(c,d)
        END DO
     #pragma omp taskwait
     #pragma omp taskloop grainsize(...)
         DO i=1, n
           C(a,c)
         END DO
  }

So I'd need to find dependencies between loops, and upon finding them add a taskwait directive - how should I do that/where would that go? I assume this needs to be another separate Transformation to be applied somehow? (Does PSyclone have explicit transformations for barriers or anything similar?)

sergisiso commented 3 years ago

I just realize there are a few outdated things in the documentation example you are using that we don't do anymore. These are the return schedule and memento (just return None if you want) and the transformation properties like omp_schedule (which are usually implemented as option dicts now, transformation should only have apply and validate public methods now e.g. rtrans.apply(schedule.children, options={"omp_schedule"="dynamic,1"}) )

So your code could look like:

# Enclose all of these loops within a single OpenMP SINGLE region
single_trans.apply(schedule.children)
# Enclose the SINGLE in an OpenMP parallel region
parallel_trans.apply(schedule.children)

What you will need is to create the new OMP directive PSyIR nodes (now are in src/psyclone/psyGen.py but PR1327 moves them to src/psyclone/PSyIR/nodes/omp_directives.py)

Create transformations that simply add this directives (single, taskloop, taskwait, ...) into an existing tree node, probably thinking which "options" and "validations" they need individually.

Then I would create a meta-transformation (a transformation that in turn applies other transformations) that looks more broadly at calculating the dependencies between loops and when taskwait should be given, and apply the previous transformations to achieve it.

From your questions:

I could create as many transformations as needed to be applied inside a code, and then just do them in innermost first order?

Yes, but the order of applying them is not important as far as the resulting tree is as expected (we can check at generation time that things are coherent but I wouldn't worry about this at this stage)

Another thing is (similar to how loops can have schedule associated), is that this single parameter may want to have nowait attached sometimes. I assume I can treat this similar to the omp_schedule part of an OMPLoopTrans?

Better use an "options" dict argument I think.

My base assumption is that for a given loop, all of the operations are independent (i.e. Doing index i+3 before index i will not be wrong, merely break reproducibility).

This is a PSyKAl model assumption, yes.

So I'd need to find dependencies between loops, and upon finding them add a taskwait directive - how should I do that/where would that go?

The kernels have a node.forward_dependence() and node.backward_dependence() methods. I would start checking if this methods provide the dependencies you are expecting.

LonelyCat124 commented 3 years ago

Once the single implementation is complete, I'll probably do a omp master implementation next (which I think is best to inherit from all the OMP Single code as it is essentially a specialization of that, but the nowait option is implicit), followed by taskloop after that.

The design for taskloop will inherit again from ParallelLoopTrans and likely be similar to OMPLoopTrans but support slightly different inputs:

  1. Optional grainsize with some default (32?)
  2. Optional num_tasks with no default - this overrides grainsize if present. It also will disallow collapse similar to OMPLoopTrans at the moment.

It also has slightly different constraints in that it must have both an ancestor of type OMPParallelDirective (excluding OMPParallelDoDirective) and an ancestor of type OMPSingleDirective (which will also cover `OMPMasterDirective as described above).

Taskloop does allow specification of private, firstprivate and other data-clauses. I think for now the data clauses can be captured from the parent parallel region (and so not specified on the taskloop directive). The one exception that I'm not sure how is currently handled in PSyclone is reduction data clauses, which have to be treated specially for task-based constructs, and specified in an in_reduction(reduction-identifier : list). It will have to enforce that reprod=False for now at least.

LonelyCat124 commented 3 years ago

Ok - from discussion with Rupert, I will implement taskwait next as 2 PRs:

  1. The OMPTaskwaitDirective - this is a standalone piece of work which can enable manual creation of valid task-based programs (perhaps to test with NemoLite2D soon).
  2. The OMPTaskwaitTransformation - this is the transformation that creates 0 or more OMPTaskwaitDirective in a section of the tree. If it has a section of the tree with no taskloops it will return an exception. This transformation will loop at the section of the tree provided, and compute the (minimum) set of taskwaits needed to correctly handle dependencies between loops. I don't believe all loops will need to be taskloops, but it should still correctly handle dependencies where there are non-taskloops.
LonelyCat124 commented 3 years ago

Ok, so now that I've implemented the NemoLite2D example using taskwait directives, and the taskloop/single/parallel transformation, I've thought a bit on how I think the taskwait transformation can work.

I think the taskwait transformation should be applied to an OMPParallelDirective, which is the first thing to check in apply. If this succeeds, then we need to work out the dependencies for all of the OMPTaskloopDirective in the subtree of thie OMPParallelDirective. I'm not sure how exactly to loop through the tree in a depthfirst search until we find a non-OMPDirective or OMPTaskloopDirective but that shouldn't be too complex.

The more complex thing for me is that this I don't think we have too many guarantees on the structure of this subtree, though we know we will have one or more subtrees like:

OMPSingleDirective(..., OMPTaskloopDirective,...)

We know we only have to worry about the OMPTaskloopDirectives inside the same single directive, so for each OMPSingleDirective we can compute the forward dependencies for each of the OMPTaskloopDirectives. We can then loop through these dependencies in order, and check: 1) Is this dependency in the same OMPSingleDirective. If not, we can ignore it as the barrier already handles it. 2) If it is, we store the number of the node within the single region (i.e. 0, 1, 2, ....) etc.

Once we have these dependencies, we can remove unneccessary ones like this:

    # Loop through dependencies, and remove unneccessary ones.
    # This is first done by looping through each dependence.
    # For each dependence, we then loop over all other dependences
    # between the nodes involved in that dependence. If this dependence
    # fulfills a following dependence, the following dependence is removed.
    # If this dependence is fulfilled by a following dependence earlier, this dependence
    # is removed.
    for i in range(len(next_dependencies)):
        if next_dependencies[i] is not None:
            next_dependence = next_dependencies[i]
            for j in range(i+1, next_dependence):
                if j >= next_dependence:
                    break
                if next_dependencies[j] is not None:
                    if next_dependencies[j] <= next_dependence:
                        next_dependence = next_dependencies[j]
                        next_dependencies[i] = None
                    if next_dependencies[j] > next_dependence:
                        next_dependencies[j] = None

Once we have done this, we can loop through this section in reverse order, and add in OMPTaskwaitDirective in the locations we computed.

@rupertford does this seem like a reasonable implementation for this transformation?

LonelyCat124 commented 3 years ago

I will probably work on the OMPTaskwaitTrans one day this week (probably tomorrow) unless anyone has specific concerns about my implementation plan.

I do still have a few questions about the tree structure (maybe @sergisiso is best placed to answer?). I know that the head of my tree will be an OMPParallelDirective which may contain one or more OMPSerialDirectives. How can nodes inside such a node be structured - the OMPSerialDirective will have a schedule containing many nodes - do all of my OMPTaskloopDirective nodes that I care about have to be direct children of the OMPSerialDirective or can there be some middle layer between Loop directives and the top layer?

sergisiso commented 3 years ago

How can nodes inside such a node be structured - the OMPSerialDirective will have a schedule containing many nodes - do all of my OMPTaskloopDirective nodes that I care about have to be direct children of the OMPSerialDirective or can there be some middle layer between Loop directives and the top layer?

I think there is no restriction, there can be middle-layer nodes if you want (I see the OMPTaskloopDirective only wants an ancestor at any level which is OMPSerial). But note that NemoLite2D is very flat, it just has the loops with the kernel inside, so I think it will be all in the same level.

LonelyCat124 commented 3 years ago

Hmm ok - the difficulty is then for general PSyclone things that don't map so cleanly onto an array, how to work out the dependencies - I will think.

sergisiso commented 3 years ago

Have you check the functionality in from psyclone.psyir.tools import DependencyTools ?

LonelyCat124 commented 3 years ago

No - I'll take a look, Rupert had only told me about the get_forward_dependence etc. functions available on Nodes.

sergisiso commented 3 years ago

There is also the import psyclone.core import VariablesAccessInfo.

I think the forward/backward_dependence are just for one level of PSy-layer fields but the other can deal with any variable dependency at any level. That said, the NemoLite2D kernels just modify the field data and it should be enough to provide you the high level dependency view (the other parameters are just immutable integers).

You can generate a dag with the schedule.dag() method to see if the dependencies is what you expect. If it is then the high level forward/backward_dependence methods may be enough.

LonelyCat124 commented 3 years ago

One thing that doesn't quite seem to fit compared with other transformations is the _directive class member. Since this can make many directives am I ok to just not create the member? From the looks of it this is just something used by the apply method, so I shouldn't need it in this case?

I also wonder if there's a preferred way to distinguish between exceptions thrown so users can decide to ignore or not ignore different ones? In this case the validate function will raise a TransformationError if not supplied an OMPParallelDirective, but I want to also raise an exception if there are no OMPTaskloopDirectives in the tree. The latter error is not really an error, but from originally discussing with Rupert he thought it might be good to ensure this so user's are aware (and can ignore it if they want), but since it raises a TransformationError for a more serious error, I'd rather raise a different error type if I can? Are there other options available?

Edit: Also to double check - the .walk function is depth-first?

sergisiso commented 3 years ago

One thing that doesn't quite seem to fit compared with other transformations is the _directive class member. Since this can make many directives am I ok to just not create the member?

You don't need this attribute. If I understand correctly what you are doing you probably want to subclass directly from Transformation and just do the apply and validate that you need.

if not supplied an OMPParallelDirective, but I want to also raise an exception if there are no OMPTaskloopDirectives in the tree.

Just to mention that another option would be a higer-level transformation where you provide a regular tree and the transformation adds the OMPParallelDirective OMPSingleDirective and the OMPTaskloopDirectives and OMPTaskWaitsDirective as appropriate, similar to what you have in the PSycloneBench PR. You don't need a transformation for each one if its easier to construct.

but since it raises a TransformationError for a more serious error, I'd rather raise a different error type if I can?

You can maybe provide an option in the transformation option map with {'fail_on_no_taskloops':True/False} and still use a TransformationError.

the .walk function is depth-first?

yes, it recurses down before going to the siblings.

LonelyCat124 commented 2 years ago

Next on the list is explicit tasks. The basic transformation here is pretty straightforward: #pragma omp task, plus additional clauses.

The clauses I'm going to implement initially are just the firstprivate and depend clauses.

For firstprivate, this is applied to any parent loop variables, so we'd have to find any direct parents of the node to be transformed, pull out the symbol name for the loop variable and apply that. If the end loop or loop step are non-constants, these should also be firstprivate.

For depend, its a bit more complex, and depends on the symbol table probably. For each variable inside the region the task is being applied to, we need to determine:

  1. Whether it is written to or read from
  2. Is it shared or private in the parent parallel region (or other enclosing region if appropriate). If its private then we can ignore it, if its shared then we have to worry about it.
  3. shared variables that are read to or written from appear in depend clauses, and it must be determined whether they are arrays or not (which I assumed we can access from the symbol table). If they're arrays, we need to check the access pattern and compare it to the length of the loop (which is tricky if the loop size is a variable). For arrays which are just indexed by array(i), then we only need a single depend clause for that array, on array(start_loop:end_loop). For arrays which use +1 or -1, then we need to also include array((start_loop+(end_loop-start_loop)) : (start_loop + 2*(end_loop-start_loop)) (for plus). I need to check whether the indexing here is inclusive.

An alternative to the full array index check, is instead to just do the dependency on start_loop and then +/- the loop length. This is as safe and could be quicker (since there is no need for the runtime to do a full analysis of the loop bounds, instead being able to compare a single memory address.

One additional check that is probably worthwhile is to have a validate function inside something (maybe OMP Serial region, though its a bit weird) that all explicit tasks that use the same loops have the same loop behaviour, since dependencies of:

array(1:32)
array(1:64)

are not dependencies according to the OpenMP spec.

LonelyCat124 commented 2 years ago

@sergisiso Do you think this should be a ParallelLoopTrans extension? In theory there's nothing preventing this being applied to any statement (though unless that statement is computationally intensive (so some sort of function call) it might not be worth doing). I guess since I'm saying it has to be a function call (and PSyclone doesn't know much about CodeBlock nodes, especially w.r.t reads/writes) its probably not worth allowing that given the additional headaches the implementation might cause?

sergisiso commented 2 years ago

Yes, I would go for the region that we fully understand (no Codeblocks).

If we then find a use case with codeblocks we have two options: generalise the transformation if possible, or decide to support the feature that causes the codeblock (which would make the current transformation still appropriate). But we don't have to worry about it unless we really find a use case where this is important.

LonelyCat124 commented 2 years ago

Ok, I think I need to write out what I need to handle before I return to coding, because it seems complicated and I keep losing my thought process.

We then need to distinguish between read and written to variables. As far as I can see, there are 3 types of statement that can exist within a Loop.walk that I am planning to handle (@sergisiso have I missed any statement types here?):

Array accesses: Array accesses are a little more complex than just generic Reference nodes. Any array access that is a shared variable (which I imagine is most or all) need to have their indexing changed. There are 3 types of array accesses I think I'm happy to support:

Edit: I think any firstprivate variable is a valid reference to use here. Edit 2: I think the depend clause on just the lower value may be better/more reasonable to implement.

I also realise at this point I've paid no attention to reductions and task reductions, so I'll have to worry about those at some point too.

LonelyCat124 commented 2 years ago

I'm also slightly unclear how to check Literal < stop-init since its very reasonable for stop and init to not be Literals (and probably the most common use case). @sergisiso @arporter any idea if this is likely to happen? I don't know what access patterns are common in GOcean and NEMO respectively (nor how often arrays are accessed using something other than the loop variable)

LonelyCat124 commented 2 years ago

Writing out an example to try to work this out:

do j=1, N, 256
    i_end = min(j+256, N)
    do i=j, i_end
        a(i) = b(i) + c(i)
        do inner = d(3), d(4)
            var = var + 1
        end do
        do inner = d(var), d(var)+9
             var = var * 2
        end do
        a(i) = a(i) + var
    end do
end do

We apply the task pragma to the loop in line 3, then the basic idea would be

#pragma task firstprivate(j, i_end) private(i, var, inner) shared(a, b, c, d) depend(in: b(j), c(j)) depend(out: a(j))

The complexity here is what to do for d. The options I can see are:

  1. depend(in: d) This one is simple to implement, i.e. if a private variable is used for an array access, then we can't safely do better than just creating a dependency for the full array. The problem is that I don't think this guarantees safety if a previous task uses depend(int: d(j)), however I think there needs to be a check built into OMPSerialRegion to check task dependencies don't lead to race conditions anyway.
  2. depend(in: d(3), d(4)) (ignoring the last loop here). If a Literal is used for an array access, we can safely add that to a depend clause provided we have the check in OMPSerialRegion
  3. Reject the access pattern, and only allow array accesses where the indexing is firstprivate (+/- Literal). This restricts the loop structures available for the task directive, however I think for most parallel loop structure this is a relatively low likelyhood anyway.

I think I will go with 3 for now

LonelyCat124 commented 2 years ago

I think Loop and IfBlock are now mostly implemented. Assignment is mostly implemented, but I am yet to handle the following cases:

  1. A BinaryOp appears inside an ArrayIndex. This is not implemented at all due to reasoning I'll discuss after.
  2. StructureReference in the RHS of an Assignment is also not handled yet.

Both of these types of nodes do have implementations elsewhere in the code, but I think that functionality is wrong.

For BinaryOp, the children of BinaryOp could include another binary op, e.g. a+b+c would be binaryOp(a, binaryOp(b,c,ADD), ADD) (or similar). This get complicated inside an ArrayIndex to pull out the appropriate index. Additionally the only array access allowed right now in other sections are firstprivate ((+/-) Literal). This is probably also wrong, it should be (Reference +/- Literal) or (Literal +/- Reference), either order should be fine. Additionally, requiring the access variable to be firstprivate shouldn't be necessarily required, though is much easier to handle. Shared (or private) variables should also be ok, but there are complications with shared variables, as they need to also not be written to anywhere (and should be declared depend(in) as well).

For StructureReference, I am currently handling these wrong, just doing writer(ref), but I imagine here I'd get something like a%b, and according to the OpenMP standard this should only be a so I need to see how to fix that. Edit: Sergi thinks I can just do x.symbol.name for this

LonelyCat124 commented 2 years ago

Ok, those issues are "fixed" in some parts where they appear now. I want to refactor all of this code so its single implementation I can reuse properly now, as the current implementation is a total mess.

LonelyCat124 commented 2 years ago

I refactored the BinaryOperation handling code, some of the other bits I'm less sure on how to manage, e.g. this code fragment appears a lot:

                            for index in ref.indices:
                                if type(index) is Reference:
                                    # Check this is a firstprivate var
                                    index_name = writer(index.symbol)
                                    index_private = (index_name in
                                                     parallel_private)
                                    if index_private:
                                        if name in private_list:
                                            raise GenerationError(
                                                    "Private variable access "
                                                    "used as an index inside "
                                                    "an OMPTaskDirective which"
                                                    " is not supported.")
                                        elif name in firstprivate_list:
                                            index_list.append(name)
                                        else:
                                            firstprivate_list.append(name)
                                            index_list.append(name)
                                    else:
                                        raise GenerationError(
                                                "Shared variable access used "
                                                "as an index inside an "
                                                "OMPTaskDirective which is not "
                                                "supported.")
                                elif type(index) is BinaryOperation:
                                    # Binary Operation check
                                    OMPTaskDirective.handle_binary_op(index,
                                            index_strings, firstprivate_list,
                                            private_list, parallel_private,
                                            start_val, stop_val)
                                else:
                                    # Not allowed type appears
                                    raise GenerationError(
                                            "{0} object is not allowed to "
                                            "appear in an Array Index "
                                            "expression inside an "
                                            "OMPTaskDirective.".format(
                                                type(index).__name__)

However, its not always quite identical, I'll need to look at it and refactor it where appropriate

LonelyCat124 commented 2 years ago

I moved this into a function for all read-only references, which makes the code much cleaner, and should make fixing errors easier,

I need to finish the Transformation and then begin testing this.

LonelyCat124 commented 2 years ago

Currently testing this with a simple code example. It fails as the

for index in indices:

as index is a List, which is not currently supported, and probably an error in the design of this section of the code.

LonelyCat124 commented 2 years ago

@sergisiso raised a good point to me today. Most of my strategy to this point has essentially come from my experience of using NemoLite2D, which has fully inlined kernels and is relatively easy to deal with, as it doesn't seem to have much indirection in general.

For using tasking in LFRic as we planned, we will need to allow certain indirection (since array accesses are not so straightforward) for explicit tasks. I think this will require (maybe?) a different OMPTaskDirective (e.g. DynTaskDirective or something) and perhaps a different DynTaskTrans as well. I think this should use domain knowledge explicitly to simplify the requirements in this case, as the current OMPTaskDirective is very general.

sergisiso commented 2 years ago

To clarify:

LonelyCat124 commented 2 years ago

Parallelising coloured loops with tasking presents a tricky problem. Parallelising the inner loops would be no problem but there is inherently no parallelism between elements of different colours or between different loops (unless there are no loop dependencies between them).

Once the generic version is working we probably should sit down and have a propert discussion about how we might do this (perhaps Tobias at Durham might have some good ideas too).

LonelyCat124 commented 10 months ago

Ok, so subject to resolving #2421 and #2415 (the latter i have a workaround involving skipping validation of InlineTrans) we can now run NemoLite2D with tasking 🥳

Checksum is identical after 1k steps with the fix to 2421. Comparing to the NemoLite2D with standard openMP parallelism we have (3 results, 8 threads, 1000 timesteps per run, on Glados so some noise is probable) OpenMP Loop: 55.45s, 60.912s, 62.841s - average 59.734s OpenMP task: 42.676s, 50.055s, 48.840s - average 47.19s

I'm pretty happy to have comparable (or potentially better) results for tasking - I will note that these results are not necessarily a fair comparison right now since the OpenMP task code does inlining of the function calls, and I don't know what the compiler will do for the Loop version. Additionally we're doing loop chunking for the task code which could improve performance but I don't know (I feel this is less likely though). Either way I feel like the task code not being significantly slower is promising!

@rupertford there's definitely still some hurdles to jump over for LFRic but initial results for NemoLite2D look good at least (and it works!).

rupertford commented 10 months ago

That's great news @LonelyCat124

arporter commented 10 months ago

That's definitely what we would call a 'Pint Point'/'Cocktail Celebration/'Beverage Brouhaha' :-)