stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran

BSD 3-Clause "New" or "Revised" License

103 stars 28 forks source link

Add support for insertion of monitoring hooks #119

Closed arporter closed 6 years ago

arporter commented 6 years ago

This was previously ticket 748 on the MO SRS (https://code.metoffice.gov.uk/trac/lfric/ticket/748).

In the discussion on that ticket it was agreed that we should support a generic interface that takes the names of both the calling procedure and module as arguments:

call profile_start( module_name, procedure_name )
...
call profile_stop( module_name, procedure_name )

arporter commented 6 years ago

Rather than have to do string matching in the profile_stop() call, I would advocate that the profile_start() call return an integer identifier for the region being created. This could then be passed to profile_stop():

call profile_start(module_name, procedure_name, region_idx)
...
call profile_stop(region_idx)

Having said that though, the granularity with which we would expect to use this profiling API should mean that there will be significant work between the start and stop calls so timing overhead should only be a small percentage.

arporter commented 6 years ago

For want of a better way of bringing people in, @rupertford, @MatthewHambley were involved in the discussion on the original ticket.

hiker commented 6 years ago

Not sure how you envision the profiling calls to be added - you mention a modification to gen(). While I understand the reason in https://code.metoffice.gov.uk/trac/lfric/ticket/748 to only support one API in psyclone, I would be interested in supporting TAU, but the tau API requires a custom data type as additional parameter, which I am not sure can be easily handled:

subroutine F1()
integer profiler(2) / 0, 0 /
save    profiler
call TAU_PROFILE_TIMER(profiler,'f1()')
call TAU_PROFILE_START(profiler)
...
call TAU_PROFILE_STOP(profiler)
end

Your approach would at least force me to re-implement parts that are already in tau (mapping of a data type ('profiler' in my example, 'region_idx' in your example) to a region/subroutine name) - and then to also dynamically manage the tau data structure for each region index.

My approach would have been to use a transform, since this would allow me to also add the declaration, so it would appear to be less work. I admit I can't easily evaluate how to best support tau (a custom transform would give more flexibility, e.g. if you want to measure individual kernels in one invoke, the transform could add the start/stop calls around all outer loops).

arporter commented 6 years ago

Thanks for your input @hiker - I have to admit I'm not familiar with the details of using TAU but is there anything stopping us from implementing the data structure within the interface to TAU? e.g. something like:

module profiling
  integer, parameter :: MAX_TIMERS = 100
  integer :: profiler(2, MAX_TIMERS)
  integer :: next_free_timer

subroutine profile_init()
   next_free_timer = 1
end subroutine profile_init

subroutine profile_start(tag1, tag2, itag)
  character(len=...), intent(in) :: tag1, tag2
  integer, intent(out) :: itag
  itag = next_free_timer
  call TAU_PROFILE_TIMER(profiler(:,itag), tag1+tag2)
  call TAU_PROFILE_START(profiler(:,itag)
  next_free_timer = next_free_timer + 1
end subroutine profile_start

subroutine profile_stop(itag)
  integer, intent(in) :: itag
  call TAU_PROFILE_STOP(profiler(:,itag))
end subroutine profile_stop

end module

I think (IIRC) that a transform approach was what we imagined - as you say, it gives us the flexibility in choosing what to time/profile.

TeranIvy commented 6 years ago

From the above discussion it seems to me that a generic interface needs the specific profiler name as well (Dr Hook, TAU, user defined). Some options/issues I thought of are below.

The profiler name could be specified in the relevant user-editable PSyclone configuration file (similar to #121) however this would fix it for the particular installation - not useful.
PSyclone configuration files could provide the list of all supported profilers' names and the specific interface would be determined in a Python run-time script (similar to how optimisations are implemented). The onus is on users to specify something that their system supports. PSyclone could raise errors if a) the specified profiler is not supported in PSyclone and b) the specified profiler is supported but it is not found when PSyclone parses the code.
How to treat user-defined profiling tools written as a part of the source code (e.g. timer subroutines which we currently have in LFRic)? Implement the routines in PSyclone as libraries; Leave them in the API source but generate the calls to them (the user-defined calls would have to be removed in that case; TBC...?

hiker commented 6 years ago

@arporter Yes, of course it can be done - but not as simple as you state: first of all, you need locks (which is reasonable trivial). But (a bit more work) you also need to check if a certain code region has already been created (profile_start might be called more than once), and certain advanced tau features (grouping events, and distinguish phases in a computation) will not be accessible.

Admittedly not very strong reasons, and now that I noticed that region_index can of course be an output variable to the profile_start call we don't need to use a string (module+procedure name) to check if a loop is executed for the first time or not - as I somehow imagined when writing my original reply.

So after some more thinking (literally the whole day ... this is kind of the third answer I type ;) ):

Using a simple API and implementing an interface to existing profilers is easy to do, and implementing the API for existing profiler is reasonable easy. We will lose some features (e.g. tau grouping of events, and different phases), but I admits that's acceptable. If a user wants more sophisticated profiling, they can still write custom transforms to do so after all. There might be more changes to a user's build environment necessary (linking in the psyclone->profiler lib), but again no big deal. Additional benefit: you can decide at link time which profiler to use (at least for custom regions). Tau for example will introduce every subroutine (and omp region etc), so de-facto it will not be possible to simply relink to use drhook instead of tau, but it should work for simpler profiler.

Making psyclone directly create the various calls for various profiler would be more work in implementing and maintaining them. I tried to come up with a way to generically specify code creation for only tau and drhook, and already failed (well, would become too complicated).

So short summary: I am happy with supporting a standard profiling api, and then have a wrapper to interface to the actual profiler used. One suggestion: add a init function:

int region_idx
call profile_init(module_name, procedure_name, region_idx)

do i=1, number_of_iterations
  call profile_start(region_idx)
  ... 
  call profile_stop(region_idx)
enddo

Somewhat reduces the overhead of calling init several times if there should be an outer loop.

I guess it becomes a fight about which profiler to make it easier to support: the above is obviously meant for tau, while the drhook api would need to store the mapping of name to region_idx. With the original suggestion above, it would be the tau interface's task to store similar info (as described by @arporter). I am ok with both :)

MatthewHambley commented 6 years ago

I can't see a way of directly replying to a comment so I'll collect some thoughts here.

Given that the proposed profile_start is returning a single value does that not make it a candidate to be a function?

More importantly, I think what is being described for handling profiling libraries with more complicated calling conventions is a context variable. It's one way of implementing object orientation if your language doesn't explicitly support it.

It can be done as an opaque handle which might be a key into a hash table. Alternatively it can be done as a (switching to C style terminology) void * pointer to an arbitrary data structure.

The point is it allows calling code to handle implementation specifics without knowing anything about them.

We might implement them using an abstract profile_context_type which each implementation implements. The DrHook implementation would simply contain an integer while the TAU version would hold one of these special TAU data types.

I appreciate this is adding complexity to the profiling callipers and therefore altering the result.

arporter commented 6 years ago

Rupert and I had a chat about this today (we were finally in the same office at the same time). We like Matthew's suggestion of an opaque data type, the content of which depends on which profiler is being targeted. As Iva says, the onus will be on the build system to link-in the correct implementation of the (generic) profiling API as well as any profiling libraries themselves. The key thing is for us to agree on the generic profiling API - we can all then go away and implement this API for our profiler of choice :-)

Although some sort of init routine would be useful in principle, it is not clear to us how PSyclone would insert such a call - PSyclone only works with Invokes and is not privy to any other information about the application. The best we could think of was some sort of

if(first-time-called)then 
  perform init
end if

within the generated PSy layer. It isn't pretty but wouldn't add too much overhead.

MatthewHambley commented 6 years ago

I think initialising the profiling tool is the responsibility of the model driver layer. PSyclone may assume it has been done.

Apart from anything else there is a fair chance that things at the driver level will need to be profiled, so it must be set up before that.

rupertford commented 6 years ago

So we seem to be in agreement that a generic api is the best way to go for PSyclone and that a simple start(handle), stop(handle) implementation would work (ignoring the issue of initialization for the moment) with handle either being an opaque object or a context variable (e.g. integer).

We are also in agreement that it would be useful to be able to use transformations to place the generic api calls as this gives flexibility in where and when they are added. Note we still might end up providing a simple 'with profiling' option to the top level psyclone script which does something generic like instrument all invoke's, but that depends on the users requirements.

So this api is essentially about profiling regions of code, we are not talking about general events, so I think profile_start and profile_end are reasonable names. The other api that would be useful to have at some point is one that provides data (fields etc), which could be useful for debugging, or 'online' visualisation, for example.

I agree with @MatthewHambley that users could add their own profiling calls outside of regions and that this should be fine from PSyclone's point of view. Of course, it could be that this practice is disallowed through some coding convention but that is not really an issue for PSyclone.

Where I disagree with @MatthewHambley is with the idea that we can assume that the user will have set up all profiling handles outside of PSyclone. I think we need to assume that PSyclone will need to set up profiling handles. The main reasons for this are 1) it is a pain for the user and 2) we could easily get out of sync between manually written and generated code.

The problem as @arporter stated is that PSyclone currently knows nothing about a code other than what is within an invoke call.

So assuming PSyclone needs to set up profile handles I see 3 options.

1) what @arporter suggested, which is to do it through the profile_start(...) api. The problem with this as @hiker pointed out is performance and the need for locking, which will need to work for all parallelisation strategies. This latter point worries me as we would need to support all future targets.

2) we support separate profile_init calls which must be added within an invoke but must be outside any shared memory parallel region. This is probably the most pragmatic solution.

3) the user adds a single profile_init() call to their code at the appropriate place and PSyclone creates the appropriate initialisation code within the subroutine.

Are there any others?

I would vote for option 2.

hiker commented 6 years ago

@arporter and @rupertford suggested that I work on this ticket, so here I go :) 1) Besides tau, drhook, and the UM-internal timers, are there any other tools I should consider? I had a quick look at Arm (Allinea's) map, but it appears to be just line oriented, not region (according to a team member who attended map training recently). 2) Re using either an opaque data type, or a simple index. I will see what is cleaner and easier to understand or has less overhead. One minor adjustment: I would prefer to add a 'save' to the declaration (I am using an integer index as example, an opaque type can be similarly implemented):

   integer, save :: region_idx=0

The save option would allow profile_start to detect if the variable has been initialised, e.g.. inside of profile_start:

   if(region_idx == 0)
     region_index = getNextIndexThreadSafe();    ! or whatever

3) I am not sure if we had a misunderstanding with me mentioning the initialisation: I didn't think that there are two kind of initialisation: one for the whole profile library (which I think you were mostly talking of in the thread), and one for each region_index - which I was talking of. Regarding the latter: for now I'll hide the initialisation of the profile-library specific per-region initialisation in profile_start, so drhook etc won't have additional overhead - though depending on the amount of work I might revisit this later. 4) The profile-library initialisation: I assume that the profiling library must be initialised after MPI_Init (in case of MPI), i.e. we can't use a constructor for a static object (which would be called before 'main') to initialise the profiling library (which would allow us to do this transparently). Maybe overwrite MPI_Init using the MPI profiling interface, but that doesn't work for non MPI codes. Initially I will assume that the users adds a profiler_init(...) call somewhere manually, and we can discuss other options later. 5) I like @rupertford's suggestion about a command line flag to automatically instrument all invokes. And I would actually suggest all invokes plus each kernel (including the psyclone created loops). 6) Using a transform to allow more specific instrumentations: I need to study the code in more detail, but since it is necessary to declare a variable for each profiler region, we need a way to get from the schedule(s) that a transform receives back to the enclosing outer program unit to add the declarations (or add the program unit as parameter to the transform, which is ... untypical compared with other transforms). I need to study psyclone in more details to decide how to handle this - hopefully I can just follow the parents up till I reach the subroutine/function statement. 7) Where do you want me to add the profile-specific Fortran interfaces (e.t. profiler_drhook.f90, profiler_tau.f90 or so). Is a separate subdirectory next to src ok?

Have I forgotten anything?

rupertford commented 6 years ago

I think we need a dummy profile library and in general we would also like to link to Andy Porters timer library - dltimer which is on bitbucket, but that is a job for us! I would keep it as simple as possible for this issue and would either just support a dummy profile library or additionally also choose one profile library (the one most useful to you I guess :-)).
Andy and I had a discussion about this. Yes a save attribute would allow more efficient code as you would be able to avoid the init in subsequent calls. This would be generated by PSyclone in any case so it should not matter how it is implemented.

3+4. I was probably mixing up the two. Would it be too much of an overhead to include an if test to see whether the profiler has been initialised and, if not, initialise it?. We would have to test in any case to catch the case where the user has not wouldn't we? As for initialising each region, yes I was suggesting a separate (potentially optional) init.

We could think about more options.
You will be in psy-layer code-generation land. This is quite a nice place to be. For example, you can declare variables and the underlying f2pygen library will bubble them up to the correct place.
That's a good question. Andy and I have talked about this as well. We think that a dummy profile library would be needed and suggest adding this in lib/dummy_profile, or something similar. We were not sure whether it makes sense to include implementations in the PSyclone repo or not. In the end we agreed that any implementations could be put in a new contrib directory at the top level of the project. If a generic profiling interface turns out to be useful beyond PSyclone we could move the implementations into their own repo at a later date.

hiker commented 6 years ago

Good idea with the dummy library - tick. Two more questions, the first one going back to point 6) above:

1) I know that adding variables etc at creation time of the PSy-layer is trivial. I am concerned/confused about implementing a transform to do the same (which allows the user more fine-tuning of the regions to measure). I had a look at the information that is available in a transformation - and it is usually only the schedule or nodes. As far as I can see they do not contain enough information to allow declaration of a variable, which would make it more or less impossible to add profiling calls using a transform - am I missing something? I see a few ways around it:

Add a pointer to the corresponding program unit (parent) to the schedules??
Since the calling script does (usually?) have the parse tree, the necessary node could be passed as a parameter, but that would somehow 'mix' things, since typically transforms only work on the schedules directly, not on the parsed/created Fortran code (or??).

For now I would suggest to focus on implementing the profiling calls when generating the psy layer directly, and discuss a potential implementation as a stand-alone transform later.

2) There is one potential issue with the profiling interface: if the profiling interface is made thread-safe using locks, it implies that the main program (and linking) must be done with threading, which can impact performance (I have seen a few percent performance loss just by compiling with openmp, without even using it - likely due to the usage of a thread-safe runtime library which adds overhead). So how we can support thread-safe and non-threaded codes?

We have two versions of each call: profile_start and profile_start_mt (multi-threaded) etc. And via a command line option the user can specify which one to call (e.g. --profile and --profile-mt or so).
We have two versions of each call: profile_start and profile_start_mt etc. We make Psyclone smart enough to know if a psy layer is with openmp or not. Afaik openmp is added as a transform after the psy layer is created. So this would only work if we have a transform that adds the profile calls (and which is applied after the openmp transforms). Even then it would require a bit of work since the user might decide to fuse, ... loops before adding openmp, or a region might be a sequence of serial, threaded, serial code etc. Not sure how feasible this is.
We leave this to the profiling library: they could be compiled in two versions - one thread-safe, the other not, and the user need to pick the right one at link time: -ldrhook and -ldrhook-mt etc.

Any suggestions?

rupertford commented 6 years ago

I'll answer each question in turn

1) PSyclone transformations only modify the PSyclone IR (what we call a schedule), they do not get involved in code generation. You would need to add new classes to the IR which represent profile_init, profile_start and profile_stop. You can use other existing classes as templates e.g. HaloExchange, or Loop in psyGen.py. They all inherit from Node. These classes generate the fortran code via a gen_code() method. The job of the transformation would be to place instances of the profile_init, profile_start and profile_stop classes in the appropriate location in the schedule.

2) (Andy and) I think we leave it to the profiling library.

hiker commented 6 years ago

More questions: 1) What is the best way to ask question (e.g. about coding style preferences etc)? Just post it here? Email? 2) What is your preference re command line handling: atm command lines are passed through a few levels, e.g. via main to generate() to PSYFactory. For profiling (I think) I need to then pass the profiling flags to PSYFactory.create(), to the various PSY classes (e.g. GOPSy), then to the psy base class. While I have implemented this so far, I am wondering if you would prefer a separate class that stores all command line options, which can be queried from everywhere? Fewer arguments to pass around (might make pylint happy), though on the other hand a dependency to the config module in many places.

There already is a config file, but that appears to be read-only (except for config.DISTRIBUTED_MEMORY, apparently because of some testing problems).

3) I'll initially be working on a branch at: https://github.com/hiker/PSyclone/tree/profiling-interface with a small project page at: https://github.com/hiker/PSyclone/projects/1?fullscreen=true

rupertford commented 6 years ago

1) Either would be OK but I would go for email if it is a general question. In theory the wiki page should give information on coding style so if there is something that is unclear we should make sure we capture it in the wiki if and when it is agreed.

2) I think you could use the config module. If I remember correctly we no longer have testing problems since we started using monkeypatch to modify classes and variables in tests.

3) The project option looks nice. Andy has used this but I have not yet.

Note, I am not (yet) convinced by the proposed profile API!

hiker commented 6 years ago

Just an intermediate update:

I have created a Profile node, which can be inserted in the graph (https://github.com/hiker/PSyclone/commit/1c468b792c8588218c9a7797d45e4042673dc675), thought I guess slightly different from what you had in mind (I only now noticed that you recommended to insert a profile_start/stop node) : I insert the Profile node as parent of KernelCall (in psyGen), which as far I can see means it will work for all APIs. It looks like this in dynamo:

Schedule[invoke='invoke_gcr_iterloop_group3' dm=True]
    [Profile]
        Loop[type='dofs',field_space='any_space_1',it_space='dofs', upper_bound='ndofs']
            Call inc_a_times_x(const,v)
    [Profile]
        Loop[type='dofs',field_space='any_space_1',it_space='dofs', upper_bound='ndofs']
            Call inc_a_times_x(const,s)

or in gocean:

GOSchedule[invoke='invoke_1',Constant loop bounds=True]
    [Profile]
        Loop[type='outer',field_space='cu',it_space='internal_pts']
            Loop[type='inner',field_space='cu',it_space='internal_pts']
                KernCall compute_unew_code(unew_fld,uold_fld,z_fld,cv_fld,h_fld,tdt,dy) [module_inline=False]
    [Profile]
        Loop[type='outer',field_space='cv',it_space='internal_pts']
            Loop[type='inner',field_space='cv',it_space='internal_pts']
                KernCall compute_vnew_code(vnew_fld,vold_fld,z_fld,cu_fld,h_fld,tdt,dy) [module_inline=False]

The ProfileNode's gen_code() will then declare the variable, and add the profile_start/end calls, e.g.:

      TYPE(ProfilerData), save :: profile
...
      CALL profile_start("compute_cu_code", "psy_shallow", profile)
      DO j=2,jstop
        DO i=2,istop+1
          CALL compute_cu_code(i, j, cu_fld%data, p_fld%data, u_fld%data)
        END DO 
      END DO 
      CALL profile_end("compute_cu_code", "psy_shallow", profile)

Compared with creating separate start/end objects the advantage is that this will always create proper nested profile_start/stop calls - if we had single profile_start and profile_stop functions, the profile_stop could be moved to the wrong place (e.g. inside of the loop), resulting in a mismatch between start and stop calls. Then again, you know PSyclone better than I do ... perhaps I am missing something?

I still need to create a transform that can insert the profile node in the graph, e.g. in a custom script (atm I only support a command line option --profile kernels).

There is one problem though:

    [Profile]
        HaloExchange[field='grad_p', type='region', depth=1, check_dirty=False]
        HaloExchange[field='p', type='region', depth=1, check_dirty=True]
        HaloExchange[field='div_star', type='region', depth=1, check_dirty=True]
        HaloExchange[field='hb_inv', type='region', depth=1, check_dirty=True]
        Loop[type='',field_space='any_space_1',it_space='cells', upper_bound='cell_halo(1)']
            KernCall scaled_matrix_vector_code(grad_p,p,div_star,hb_inv) [module_inline=False]

The halo exchanges are created inside the profile region :( I need to study the code a bit further to see what to do about this - suggestions welcome :)

I noticed one minor additional problem: I used the name of the (first ... in case of several kernel calls merged into one loop) kernel name as the subroutine name for the profiling API - but it is of course entirely possible that the same kernel name is used more than once:

  call invoke(set_zero(a), set_zero(b))

So I added the name of the kernel to the NameSpace object, so in the case above the first profile would be called 'set_zero' (as the subroutine name parameter), the second one then becomes 'set_zero_1'. I hope that's acceptable.

I also had the impression that I might hit a snag when I try to implement profile calls for whole invokes: As far as I could see I would need to work on the level of Invokes (or PSy, which is on top) - and none of those classes has 'Node' as a base class (e.g. they work on a self.invoke_list, not on the list of all children). I need a bit more time, but any insight is welcome :)

rupertford commented 6 years ago

Hi Joerg,

1) Adding a single profile node, rather than start/stop is the correct way to do it 2) A transformation will allow you to add the profile where you like. I'm guessing that, at the moment, you are adding the node too early i.e. before halo exchanges are added?. You need to do it after halo exchanges have been added. Actually, you need to do it after any transformations have been applied. 3) Using the namespace manager sounds a little dodgy as it is a singleton, used for variable names, and I guess you are using it to create a unique string. 4) We are limited to profiling within a single invoke. This should not be too bad as all the work is done within invokes. If profiling is required over multiple invokes we could use the same name and increment values I guess, but the underlying library would have to support that somehow. In particular, there is no obvious way to tell the profiling system to output its incremented info, unless it always does this or we add another node to do this?

hiker commented 6 years ago

@rupertford, @arporter : can I have a very very brief review of the basic implementation of the profiling interface? Before I finish the implementation, see if the API works with an actual Fortran library etc, I want to make sure that my graph modifications are what you expect them to be.

Here some examples from example/dynamo/eg8:

$ psyclone   --profile kernels --profile invokes   -oal alg -opsy psy -api dynamo0.3 ./helmholtz_solver_alg_mod.x90
...
      TYPE(ProfilerData), save :: profile_1
      TYPE(ProfilerData), save :: profile
...
      CALL profile_start("scaled_matrix_vector_code", "helmholtz_solver_alg_mod_psy", profile)
      CALL profile_start("unknown-kernel", "helmholtz_solver_alg_mod_psy", profile_1)
      DO df=1,grad_p_proxy%vspace%get_last_dof_owned()
        grad_p_proxy%data(df) = 0.0_r_def
      END DO 
      CALL grad_p_proxy%set_dirty()
      CALL profile_end(profile_1)
      CALL grad_p_proxy%halo_exchange(depth=1)
      IF (p_proxy%is_dirty(depth=1)) THEN
        CALL p_proxy%halo_exchange(depth=1)
      END IF 
      IF (div_star_proxy%is_dirty(depth=1)) THEN
        CALL div_star_proxy%halo_exchange(depth=1)
      END IF 
      IF (hb_inv_proxy%is_dirty(depth=1)) THEN
        CALL hb_inv_proxy%halo_exchange(depth=1)
      END IF 
      CALL profile_start("scaled_matrix_vector_code_1", "helmholtz_solver_alg_mod_psy", profile_2)
      DO cell=1,mesh%get_last_halo_cell(1)
        CALL scaled_matrix_vector_code(nlayers, grad_p_proxy%data, ...)
      END DO 
      CALL grad_p_proxy%set_dirty()
      CALL profile_end(profile_2)
      CALL grad_p_proxy%halo_exchange(depth=1)
      CALL profile_start("enforce_bc_code", "helmholtz_solver_alg_mod_psy", profile_3)
...
      CALL profile_end(profile_4)
      CALL profile_end(profile)
    END SUBROUTINE invoke_0

As discussed elsewhere the set_dirty/clean calls are inside the profile regions (since they don't have a separate entry in the graph, they are part of the actual loop), but that's likely not a problem since the calls are cheap. The two command line options control if the whole invoke and/or all kernels are instrumented. Some details are still missing (e.g. a 'use profile_module, only: profile_start, profile_end'; handling of builtins, ...), The schedule (when enabling kernel and invokes regions) looks like:

Schedule[invoke='invoke_0' dm=True]
    [Profile]
        [Profile]
            Loop[type='dofs',field_space='any_space_1',it_space='dofs', upper_bound='ndofs']
                Call setval_c(grad_p,0.0_r_def)
        HaloExchange[field='grad_p', type='region', depth=1, check_dirty=False]
        HaloExchange[field='p', type='region', depth=1, check_dirty=True]
        HaloExchange[field='div_star', type='region', depth=1, check_dirty=True]
        HaloExchange[field='hb_inv', type='region', depth=1, check_dirty=True]
        [Profile]
            Loop[type='',field_space='any_space_1',it_space='cells', upper_bound='cell_halo(1)']
                KernCall scaled_matrix_vector_code(grad_p,p,div_star,hb_inv) [module_inline=False]
        HaloExchange[field='grad_p', type='region', depth=1, check_dirty=False]

I store the command line options to PSyclone in the profiler class: https://github.com/hiker/PSyclone/blob/profiling-interface/src/psyclone/profiler.py I am using some static methods/variables - is that OK? I have not noticed you doing this, but I prefer that the functions are kind of in another namespace, so you need to write Profiler.add_profile_nodes, which makes it more obvious where the functions are coming from (imho).

class Profiler(object):
....
    @staticmethod
    def add_profile_nodes(schedule, loop_class):

This functions uses a transform to add the profiling calls, see https://github.com/hiker/PSyclone/blob/profiling-interface/src/psyclone/transformations.py, line 1482 It should also be possible to use this transform in user scripts, without the command line options.

Code is created from the ProfileNode instances that have been added to the graph.

And the various APIs call add_profile_nodes to add the nodes if requested by the command line options, E.g. (https://github.com/hiker/PSyclone/blob/profiling-interface/src/psyclone/dynamo0p3.py):

class DynInvoke(Invoke):
...
def __init__(self, alg_invocation, idx):
    ....
    # At the end of __init__:
    from psyclone.profiler import Profiler
    Profiler.add_profile_nodes(self.schedule, DynLoop)

I.e. the Invoke class constructor calls add_profile_nodes, which adds (if requested) an 'outer' ProfileNode using a transform for measuring all kernels in one invoke, and a ProfileNode (if requested) around all kernel calls.

Is this how you imagined this to be implemented? I.e. are the ProfileNodes where you expect them to be, and calling from the API-specific Invoke constructor is ok as well?

Where do you want to have the transform? I would actually prefer to have the Profile transform in profiler.py - this way all profiler related code is in one file (except that each API has one call to add_profile_node), but my guess is that you prefer to have all transforms in the one file.

Please don't do a full review, the code is not ready for this (I will start with a clean branch, since this branch has a lot of tries and errors committed - mainly because I needed backups and had to keep different platforms in sync ;) ) - I just want to know if I have missed any fundamentals :)

hiker commented 6 years ago

I've started a clean branch: https://github.com/hiker/PSyclone/tree/monitoring-hooks

Could you provide some feedback on my testing methods, e.g. https://github.com/hiker/PSyclone/blob/monitoring-hooks/src/psyclone/tests/profile_test.py

    # Conver the invoke to code, and remove all new lines, to make regex matching easier
    code = str(invoke.gen()).replace("\n", "")
    correct_re = ("subroutine invoke.*"
                  "call profile_start.*"
                  "do j.*"
                  "do i.*"
                  "call.*"
                  "end.*"
                  "end.*"
                  "call profile_end")
    assert re.search(correct_re, code, re.I) is not None

This is a lot easier to handle than getting the full Fortran code as a comparison (the test for using the profile unit is not there yet ... I know ;) ).

1) Do you consider this sufficient? Or do you prefer to compare the full Fortran code (which imho would be more fragile to changes in output format)? 2) How do you want to organise the tests? Atm I intend to have them all in one file profile_test.py (all = for all APIs). Do you prefer to have the profile tests in API specific files? E.g. dynamop3_profile_test.py etc? Or any other way?

arporter commented 6 years ago

Hi @hiker, I think that, in general, those regex checks are OK. It might be better to check the AST rather than the Fortran and have some separate tests that check that the Fortran obtained from a given AST is correct. I'd also want to see checks (in at least one case) that the correct arguments are generated for the call(s) to the profiler API.

Test arrangement is difficult because of the interdepencies between different functionalities. I think for profiling, I'd have:

tests/
    profiling/
        dynamo0p3_profile_test.py
        gocean1p0_profile_test.py

Having said that, hopefully there are minimal differences between the use of profiling in the different APIs and therefore it is probably not worth breaking the tests into separate files. We should include tests for the case where the AST is first modified by transformations.

arporter commented 6 years ago

Apologies also for not responding to your request for a "mini review" - I deferred to Rupert and then forgot all about it. I've just had a quick look and I think it's looking good. The use of static members seems sensible. The only issue I've spotted is that you add the Profiling to the AST of the Invoke when the invoke is created. This could cause problems when/if the Invoke is subsequently transformed. Therefore, I think this auto-instrumentation needs to be performed immediately prior to code generation.

I think for consistency, the profiling transform needs to be available from the transformations module. However, if Rupert agrees it could live in the profiling module and just be made available in the transformations module through an import (similar to fparser making Fortran2003.py available through api.py).

arporter commented 6 years ago

Regarding the problem of initialisation, Rupert and I were thinking about that again the other day (in some context that I can't now recall) and realised that we could introduce a specially-named Kernel that PSyclone recognises. It would then be up to the user to make sure they included an Invoke for that Kernel in the right place in the Application.

hiker commented 6 years ago

Thanks for your feedback. 1) OK, I will add tests for the AST (and/or maybe replace some of the regexp ones ... though I admit I like them ;) ) 2) There is a reason why I have not added tests about the arguments: I am not entirely sure (yet) how to create the region names, since we need to avoid using duplicated names. For the transform I probably add an explicit argument and leave it to the user, unless I can think of something better. I'll post more details once I am making a bit more progress. 3) I'll add tests for transformed regions when the profile transform is applied. 4) The command line options are meant for a 'first cut' only, i.e. when you start looking at code. After that I expect that the user will explicitly add profile transforms (I don't see an easy way to automatically 'recognise' what should be instrumented after arbitrary transforms are applied - e.g. the transforms would need to be smart enough to exclude e.g. halo exchange, but not care about set_dirty). Note that I originally had the code created at code creation time (though without inserting the node ... it actually feels wrong to modify the tree at code creation time tbh). 5) Interesting idea with the special invoke. I admit I am not that worried about the initialisation - even if we have a special kernel that is being invoked, that still means a manual code change (put the special invoke in or not) - unless this special transformed is ignored by default (and a command line options will trigger psyclone to create the right call). This seems a lot of work - what about automatically adding a 'call Profile_init()' call at the beginning of each subroutine that is being instrumented? This function can do the actual initialisation of the runtime library exactly once (and would still be less overhead than doing this in every profile_start() call) ?

hiker commented 6 years ago

@rupertford @arporter I have a first dummy profile library working (not committed yet), i.e. compilation and linking works, and it prints module and region name at start and end - hooray ;)

1) Do we want to offer some compilation/linking support with psyclone? E.g.

psy-config --cflags
psy-config --libs

to get the include (mod) path and libraries to link with (which can then be used in a (c)make file or so)? Or should we leave this all to the user?

2) Do you require any testing of those libraries (e.g. dummy, tau, drhook, ...)? That would be a bit of work if we need to use different compilers - can we just use gfortran or so, to reduce complexity in testing?

rupertford commented 6 years ago

Hi Joerg,

Sounds like good progress :-)

1) to my shame I'm not sure what you mean by psy-config ***. Is it a standard thing? Does it provide the required includes and libraries as a text string, or does it somehow interface with cmake? Are there other tools that support this functionality or would it be bespoke for PSyclone?

2) My initial reaction is no, we limit testing to the generation of the profile hooks with a dummy library (if we are compiling as part of our tests). I would expect the actual linking to libraries such as tau to be tested separately.

At the moment we compile tests where appropriate for the dynamo api but not for the gocean api. In that case we only test one compiler, which defaults to gfortran.

Regards

Rupert

Computational Scientist STFC Daresbury Laboratory t: 01925 60 3217

Visiting Researcher School Of Computer Science University Of Manchester t: 0161 275 5724

From: hiker notifications@github.com Sent: 02 May 2018 14:15:55 To: stfc/PSyclone Cc: Ford, Rupert (STFC,DL,HC); Mention Subject: Re: [stfc/PSyclone] Add support for insertion of monitoring hooks (#119)

@rupertfordhttps://github.com/rupertford @arporterhttps://github.com/arporter I have a first dummy profile library working (not committed yet), i.e. compilation and linking works, and it prints module and region name at start and end - hooray ;)

Do we want to offer some compilation/linking support with psyclone? E.g.

psy-config --cflags psy-config --libs

to get the include (mod) path and libraries to link with (which can then be used in a (c)make file or so)? Or should we leave this all to the user?

Do you require any testing of those libraries (e.g. dummy, tau, drhook, ...)? That would be a bit of work if we need to use different compilers - can we just use gfortran or so, to reduce complexity in testing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/stfc/PSyclone/issues/119#issuecomment-385973818, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHCofSVmbZTNLBYATa7XDddElJncGxYsks5tubGLgaJpZM4RQi23.

arporter commented 6 years ago

Rupert beat me to it but Whoop!

I was about to say we leave this for the user but then I looked back through this ticket and realised that we had agreed that the profiling interface library be included with PSyclone (at least for the moment). Therefore, we will know where it is and can tell the user so yes, it's a good idea. I'd leave it for a separate ticket though.
I agree with Rupert - we only worry about the dummy lib as we don't want to get into a world where we depend on e.g. Tau being installed.

hiker commented 6 years ago

I was thinking of something similar to pkg-config (though now that I wrote it maybe we could even use pkg-config to specify the lines - I assume that should be reasonable easy). I just thought of a simple look-alike psyclone specific command that just prints the flags to stdout (which should be usable from things like make/cmake): --cflags gives you the compiler flags (-I somewhere/contrib/dummy or so) --libs the linker flags (-L somewhere/contrib/dummy -l dummy_profile_lib) OK, when we add a separate ticket, then I won't add this here for now and leave this for later. Thanks!

hiker commented 6 years ago

I've decided to add a simple, non-threadsafe timing profiling implementation as an example as well (quite simple, nearly done), but I noticed one issue: I need a ProfileFinish() or so call to print out the results at the end. While this could potentially be simulated using C++ (you declare a global static object, when the program ends its destructor will be called), but that is a bit hacky, complicated. and not even sure if it works, since afaik a destructor of a static object is called after stdout is closed, so we can't simply print data in the destructor anymore, and the variables/results in Fortran are likely not available anymore as well (bad if your profile library stores pointer to Fortran data structures for timing, so we would need to allocate all this stuff), ...

So going back to @arporter's and @rupertford's idea about a special kernel invoke, I think that would provide a nice and flexible way of supporting both the initialisation and finalisation:

 call invoke( ProfileInit() )
...
 call invoke( ProfileFinalize() )

which is then replaced by PSyclone by either nothing (if profiling is disabled, those statements would just disappear), or calls to ProfileInit/ProfileFinalize if any kind of profiling is enabled. Disadvantage is of course that the names then can not be used by a user program as kernel names anymore - but I think that's acceptable. Would that approach be ok?

And what about the spelling: ProfileFinalize() or ProfileFinalise()? Afaik 'z' is acceptable in 'all kinds of English', right?

rupertford commented 6 years ago

Andy has an existing profiling library if you would like to use that instead. We plan to provide an interface to it when the profile hooks are on master.

I think it would be preferable to have transformations which add profileinit and profilefinalise at appropriate positions, rather than the algorithm writer have to add them. This would avoid pollution of the algorithm code and give flexibility in how the profiling is used. We could also potentially (in the future) extend transformations to adding code to the algorithm layer outside of invokes e.g add profileinit at the start of algorithm file x and at the end of algorithm file y, or before invoke x and after invoke y. Having said that, I'm not strongly against having the option of the algorithm writer adding them as well as having a transformation, but I think a transformation is the place to start.

As for names, I go for s rather than z.

rupertford commented 6 years ago

Andy and I have just had a discussion about this and we think the following

The main problem seems to be the issue of printing results, which should probably be kept separate to finalise.

The solution we would like to aim for in the long term is being able to add code to the algorithm layer (including files that do not contain invokes). We could then have an algorithm transformation which adds in init, finalise and or print calls.

The shorter term solution is to either a) get users to add these manually outside of invokes i.e. PSyclone is not involved - this implies the profiling api is public or b) have transformations which add them inside invokes. Options a) and b) are not mutually exclusive, it should be possible to do a mixture of the two.

We think is less of a good idea is to add builtin profileinit, finalise and print names to invokes as existing invokes can add them via transformations and if we are going to add a new invoke just for this case we might as well call the profiling library directly at this point.

hiker commented 6 years ago

Can you send me the URL for dltimer, I can't find it. I want to add a wrapper for that library as well, to make sure I have all corner cases handled. My timer library is extremely simple, more like a tutorial how to use the PSyclone profiler API to create your own wrapper.

I am not sure if it's worth distinguishing between finalise and printing results, but then again I don't care too much if it's 2 or 3 functions (I would go for two functions so that the user (or the transform) only needs to add two functions (init/finalise) instead of 3 (init/finalise/printResults), but that's minor).

Re adding those calls: it appears that for now we accept that the user manually adds the Profiler API call, and we improve on this later as a separate issue. Works for me :)

arporter commented 6 years ago

All sounds good to me - I'd go for two functions too (with s's :-)). dl_timer is at: https://bitbucket.org/apeg/dl_timer

hiker commented 6 years ago

OK, I have a dummy profile lib (printing 'start profiling' etc in each function), a simple timer function (https://github.com/hiker/PSyclone/tree/monitoring-hooks/contrib/simple_timing), which produces:

===========================================
 module::region                                         count           sum                     min             average                 max
 psy_inputoutput::eliminate_one_node_islands_code           1     0.128906250             0.128906250             0.128906250             0.128906250    
 psy_time_step_mod::swlon_adjust_code                      11      1.19921875             0.105468750             0.109019883             0.113281250    
 psy_time_step_mod::swlon_code                             11      4.38281250             0.394531250             0.398437500             0.406250000    
 psy_time_step_mod::swlon_update_code                      11      1.86718750             0.167968750             0.169744313             0.171875000    
 psy_time_step_mod::swlat_adjust_code                      11      1.23828125             0.109375000             0.112571023             0.117187500    
 psy_time_step_mod::swlat_code                             11      4.87890625             0.437500000             0.443536937             0.445312500    
 psy_time_step_mod::swlat_update_code                      11      1.87500000             0.167968750             0.170454547             0.179687500    
 ===========================================

and an interface to dltimer (https://github.com/hiker/PSyclone/tree/monitoring-hooks/contrib/dl_timer):

=============================== Timing report ===============================
Timed using POSIX timer. Units are seconds.
Reported resolution =  0.1000E-08 (s)
Effective clock granularity =  0.25997E-07 (s)
Measured systematic error in dl_timer API =  0.37790E-07 +/- 0.789E-09 (s)
Measured overhead in calling start/stop =  0.9411E-07 (s)
Measured overhead in calling start/stop for registered timer =  0.4725E-07 (s)
-----------------------------------------------------------------------------
Region                          Counts     Total       Average*     Std Err
-----------------------------------------------------------------------------
psy_inputoutput:eliminate_one_no     1  0.12603E+00   0.12603E+00  0.00E+00
psy_time_step_mod:swlon_adjust_c    11  0.12201E+01   0.11092E+00  0.28E-02
psy_time_step_mod:swlon_code        11  0.44050E+01   0.40046E+00  0.25E-02
psy_time_step_mod:swlon_update_c    11  0.18761E+01   0.17056E+00  0.45E-03
psy_time_step_mod:swlat_adjust_c    11  0.12325E+01   0.11204E+00  0.53E-03
psy_time_step_mod:swlat_code        11  0.50031E+01   0.45483E+00  0.26E-02
psy_time_step_mod:swlat_update_c    11  0.19000E+01   0.17272E+00  0.24E-02
-----------------------------------------------------------------------------
* corrected for systematic error
=============================================================================

To finalise the scope for this ticket and reach a state where I can do the pull request: 1) I'll add support for TAU. 2) I'll add support for DrHook. 3) Documentation will be added, likely in a separate chapter; READMEs for the various libraries needs to be updated. 4) Added more test cases as requested previously (esp. using the transform instead of command line options). 5) Done: Command line options to automatically instrument invokes and/or kernel calls. :heavy_check_mark: 6) Done: a transform that can be applied in a script to manually instrument regions :heavy_check_mark:

I won't do any of the following, which can be done later as separate tickets: 1) Support for automatically adding ProfileInit and ProfileFinalise calls, at this stage the user has to add this manually. 2) Support for installation (of wrapper libraries). 3) Improved compilation support to avoid having to manually add include and link command line options (e.g. a pkg-config like function that reports compilation and link options required).

@rupertford and @arporter - let me know if you want anything else added to this ticket.

arporter commented 6 years ago

Hi @hiker, that sounds like great progress. If I were you I wouldn't bother with TAU and DrHook for this ticket - that's something that can be done in a separate one.

hiker commented 6 years ago

OK, am nearly done with this - docu is there, and nearly all test cases. Two questions: 1) I had severe problems testing some of the error/exception handling, since with 'normal' usage nodes further up in the tree would catch those problems earlier. So I had to implement a very hacked way: https://github.com/hiker/PSyclone/blob/8d955e9c877fb48a14652a4096de1cb18af33fff/src/psyclone/tests/profile_test.py#L160 I created a dummy AST to pass directly to gen_code(). Is that acceptable for you? If not ... suggestions for testing welcome ;) 2) ATM I have the profile libs in a contrib subdirectory (https://github.com/hiker/PSyclone/tree/monitoring-hooks/contrib ... doc for this still coming), since this was suggested earlier. I wonder if this is a good location. I would either suggest to use:

lib/profiling/{template, simple_timing, dl_timer} instead (and add drhook and tau there later), or
use contrib/profiling/{...}, or maybe even
profile_libs/{...}

Basically add all the profiling libs under one subdirectory, which would be useful if we should ever add some other supporting library to psyclone (e.g. visualisation support, debugging support, ...).

I hope to have the pull requests in a few days.

rupertford commented 6 years ago

Hi Joerg,

Yes, providing a dummy AST is fine if that is the simplest way to force the exception. As I'm sure you know, we tend to use monkeypatch where possible. However, forcing internal errors can become a dark art.
I have no strong view here so would be happy for you to choose. If I had to pick, lib/profiling sounds good.

hiker commented 6 years ago

Could someone point me to the Dr Hook repository? It seems a bit embarrassing, but I can't find it - only some random/unspecified version that BOM people use in some of their codes, and some older links (walesnix.earthsci.unimelb.edu.au/svn/drhook/tags/um/8.0) that seem not to work anymore

MatthewHambley commented 6 years ago

As far as I am aware there is no Dr Hook repository. It is developed internally by ECMWF and they roll a tarball from time to time if you ask them nicely.

arporter commented 6 years ago

171 has been merged, closing Issue.