Streamlining list of files for some stages

Once #280 is merged, I want to get rid of the many differences places from which files for compilation etc are being picked up. When I needed to introduce a new step before running PSyclone, I found it very hard to identify from which artefact stores to pick up files :(

First issue: preprocessing Fortran files will copy .f90 files into the build directory, but does not add it to any artefact. So a new step has no easy way of finding all Fortran files!

Follow up example (analysis)

DEFAULT_SOURCE_GETTER = CollectionConcat([
    SuffixFilter('all_source', '.f90'),
    'preprocessed_c',
    'preprocessed_fortran',

    # todo: this is lfric stuff so might be better placed elsewhere
    SuffixFilter('psyclone_output', '.f90'),
    'preprocessed_psyclone',  # todo: this is no longer a collection, remove
    'configurator_output',
])

So this picks up a mixture of files from all_source (the aforementioned .f90 files that are copied but not added to an artefact), then the two preprocessed ones, then psyclone output files, and config output. Imho these files should be all in ONE artefact store.

I don't mean necessary abandoning the stores we have, it might make sense to have all psyclone output files in one artefact (and ditto for cofigurator etc), but there should be a place where all f90 and all c files can be found - so that means in some cases to add files to more than one artefact store.

There are two obvious ways of implementing this:

We just use some new custom artefact names, and no further code change. This is easy, what I don't like here is that we then need to import & use the name of these new stores
We supply custom functions (which will internally use a custom name, but that name does not need to be exposed).

For now I have implemented the 2.nd way, but if you prefer 1. or something entirely different, I won't mind. Example:

class ArtefactStore: 
...
    def add_fortran_build_files(self, files: Union[str, List[str], Set[str]]):
        self._add_files_to_artefact(self.FORTRAN_BUILD_FILES, files)

    def get_fortran_build_files(self):
        return self[self.FORTRAN_BUILD_FILES]

    def add_c_build_files(self, files: Union[str, List[str], Set[str]]):
        self._add_files_to_artefact(self.C_BUILD_FILES, files)

    def add_x90_build_files(self, files: Union[str, List[str], Set[str]]):
        self._add_files_to_artefact(self.X90_BUILD_FILES, files)
...

So, if for example the .f90 files are copied into the build tree, they will be added using add_fortran_build_files, and the analysis and compilation stage can just use get_fortran_build_files to get all files, as can custom steps (and they could e.g. remove and add files depending on what they do). No more need to pick files from three or four different places.

PSyclone would add its output files to the fortran build files as well.

Additionally, as indicated in the comment of the analysis step above - psyclone is very LFRic-specific, so the code will be much easier to understand.

@MatthewHambley , it would be great to have some feedback here, so I can get this ready next.

I've had a think about this and as far as I can see there are fairly straight forward pros and cons to the two approaches suggested above.

Keying artefact sets off strings

Easily extensible, just use a new string. But easy to get wrong, no validation of key and no protections of "constants."

Set accessor methods

Mean you can't get the access wrong and validation is not needed, nor are "constants." Conversely adding a new set requires the addition of a new accessor, which is less than ideal.

Set Objects

It occurred to me that there is a third option. If the sets themselves are managed as objects the ArtefactStore becomes a collection of these ArtefactSet objects. If they are completely generic then the store can have a new_set("name") method which is a factory returning a new set object.

Alternatively, if sets may be customised by inheritance, the store could have a register_set(artefact_set) method which takes a allows it to take responsibility for a set.

Set accessor methods

Mean you can't get the access wrong and validation is not needed, nor are "constants." Conversely adding a new set requires the addition of a new accessor, which is less than ideal.

Not necessarily, the ArtefactStore could provide custom functions for a fixed subset of categories. My main issue was:

DEFAULT_SOURCE_GETTER = CollectionConcat([
    SuffixFilter('all_source', '.f90'),
    'preprocessed_c',
    'preprocessed_fortran',

    # todo: this is lfric stuff so might be better placed elsewhere
    SuffixFilter('psyclone_output', '.f90'),
    'preprocessed_psyclone',  # todo: this is no longer a collection, remove
    'configurator_output',
])

So that's why I think it might be a good idea to have the compile steps use fixed names, so the analysis step just needs to use C- and Fortran-file artefacts (and e.g. psyclone output files should be added to Fortran-files etc).

I need to run a custom step before running PSyclone (for some of my non-standard work), and I found it hard to know where all Fortran source files even are ... since some of them are in all_source.

Set Objects

It occurred to me that there is a third option. If the sets themselves are managed as objects the ArtefactStore becomes a collection of these ArtefactSet objects. If they are completely generic then the store can have a new_set("name") method which is a factory returning a new set object.

Are we sure artefacts are always sets (assuming you meant this as 'pythons sets', i.e. unordered). While I can't think of an example, maybe there could be a need to store them sorted or as dictionary, ...?

Alternatively, if sets may be customised by inheritance, the store could have a register_set(artefact_set) method which takes a allows it to take responsibility for a set.

Sure to both of these options, but as far as I can see that does not change the way how we handle them in the ArtefactManager (i.e. do we get these ArtefactSets using a fixed method, or via a key/string/enum/... ?)

FWIW, I have started to implemented the 'mix' approach: a set of functions for fortran, c, and psyclone source files, everything else is just using a string as dictionary key they way it is right now. Before cleaning this up, I thought I'd better check with you what you prefer ;)

My reason for introducing ArtefactStore as an object was to allow these functions, with some convenience support for adding an artefact as string (single artefact), list, or set. I am otherwise happy either way, I can easily overwrite __setitem__ to make sure whatever we add (string, list, set, ...) is added the expected way.

If we don't want to use hard-coded method names, I would suggest that we at least move all pre-defined keys as constants into the ArtefactStore class (instead of the file constants.py), since as far as I can see these names should only be used in connection with the ArtefactStore.

OK, since there was no decision, I have started to implemented as follows:

all collections are based on enums (declared in artefacts.py, so constants.py will only contain the directory names now).
all artefacts except OBJECT_FILES (iirc that was the name) are sets, OBJECT_FILES is still a dictionary
I will not provide custom functions (add_fortran, add_c, ...), but just a set (one or two?) convenient functions to add artefacts with the enums.
While Fab itself only uses enums, the type information for the key always includes str, allowing the user to add custom steps with custom artefacts using strings, which are certain not to overlap/clash with the enums used in Fab.

I will have a look at adding some artefacts to remove the need to combine source files from various sources. For example, analysis uses:

SuffixFilter('all_source', '.f90'),
    'preprocessed_c',
    'preprocessed_fortran',

    # todo: this is lfric stuff so might be better placed elsewhere
    SuffixFilter('psyclone_output', '.f90'),
    'preprocessed_psyclone',  # todo: this is no longer a collection, remove
    'configurator_output',

Pre-processed Fortran files are picked up from build directory, .f90 are still in all_sources etc.

So if I add another step that change some Fortran files, I need to pick the files from two different sources, and create potentially yet another artefact group (?? maybe).

Instead my idea is to add (say) a 'all_fortran_sources' artefact set, which will include all .F90 and .f90 files. Pre-processing will then replace the .F90 (from all_source) with the preprocess .f90 in build in that set. If I add another step, I will do the same: loop over the files in all_fortran_sources, and if I modify a file, I will remove the old file and add the replacement instead.

Similary, files created by other tools (psyclone, configurator) will also be added. So hopefully in the end, we will only have one artefact set with all Fortran files to compile. I will (at least initially) not touch any existing artefact sets, since there might be other plans or dependencies or use cases that I am not aware of, instead adding new ones. We can then evaluate later if some artefact sets can be removed.

metomi / fab