defining processes and pipelines schemas is too complex

denisri commented 10 months ago

Completion schemas definition is not simple enough at the moment, and it suffers several problems:

ProcessMetatada is instantiated by the end user or application. Its constructors needs to be given a datasets parameter which specifies which dataset is assigned to each process parameter (input, output, shared...). In most situations the user doesn't know about it and doesn't want to specify them. Moreover a generic application cannot have the list of all parameters of all processes for that, so processes should provide at least a default for this.
Some parameters in a process do not use all the schema metadada attributes in their filenames. OK this is bad and a "good" schema should use them all, but the current BrainVisa/Morphologist schema does not and we need some compatibility with it. The filename generation function is agnostic to the parameter and thus doesn't bother about this problem, it always produces filenames with all attributes if they have a value. So the ProcessSchema subclasses need to specify all this, parameter by parameter, and sometimes have to erase attributes values, and even worse, sometimes have to set values back after they have been erased for other parameters. Ex in capsul.brainvisa.schemas, we need to write things like:
```
  class ScalpMeshBrainVISA(ProcessSchema, schema='brainvisa',
                           process=ScalpMesh):
      _ = {
          '*': {'seg_directory': 'segmentation'},
      }
      head_mesh = {'seg_directory': 'segmentation/mesh', 'suffix': 'head',
                   'prefix': None, 'extension': 'gii'}
      }
      # ...

  class SplitBrainBrainVISA(ProcessSchema, schema='brainvisa',
                            process=SplitBrain):
      _ = {
          '*': {'seg_directory': 'segmentation'},
      }
      split_brain = {'prefix': 'voronoi',
                     'analysis': lambda **kwargs:
                         f'{kwargs["initial_meta"].analysis}'}
      }
      # ...
```
in the 1st class here we need to erase prefix, while in the second we need to restore (by a very unconvenient trick) analysis which has been erased earlier. We need to find a simpler syntax or system, maybe allowing to tell that the head_mesh parameter in ScalpMesh does not use the prefix attribute, rather than modifying it in an unintuitive way, which has side effects later. This info would be useful also in order to simplify GUIs and not display to the users attributes which are not used, or only internally defined and used (like the side attribute inside a Morphologist pipeline).
Maybe for this reason, metadata defaults don't work the way they should: for instance the sulci_recognition_session attribute in the BrainVisa schema is only used in identified sulci graphs parameters in Morphologist and other processes, and not in others. So it cannot have a default value (otherwise it would appear in all filenames of all parameters). Right now its value is set, forced, in the ProcessSchema for the labeled graph parameters, but so the user (or program) cannot change it at all.
Similarly, formats and extensions are forced here and then in processes and pipeline parameters schemas, and the user cannot choose a format for outputs, except for volumes if the default is left untouched. We need to develop something to manage formats.

denisri commented 10 months ago

For 1. I see (reading the code) that a dataset name can be assigned to a parameter in its field metadata. I don't know if it's a complete solution, but at least we have something to do it.

sapetnioc commented 10 months ago

There was a huge change in process completion paradigm and two changes makes backward compatibility difficult:

There is no more data types. Attributes usage where defined at the data type level when parsing hierarchy. Since this is gone, everything must be explicit. This concerns mainly point 2.
Parameter links are not used anymore in path completion. There is now a single global set of metadata that are used everywhere. This lead to problem 3.

We may have to reintroduce a kind of data type that would not capture all the diversity of the data but could define a metadata attribute usage model. Each model could contain some forced metadata values (allowing to force undefined values for attributes that are not to be used). To my opinion, these model names should be known by both pipelines and data schema. In pipeline, they would be simple string defined in field. They would define a partition of parameters according to their metadata need for path generation. Each dataset schema would contains an actual definition of attribute usage rules for each model name.

For point 1. default dataset of a parameter is either input or output. My idea was that the pipeline creator would precise the datasets name for other values (for instance spm for SPM templates or shared for value taken in shared data). I do not know if there are cases that are not covered by that.

And point 4 is not implemented yet. We need to put back formats in Capsul v3 and to link them to configuration to get default values.

denisri commented 10 months ago

Point 1 should be resolved in fields metadata, yes, I'm currently trying this solution. For other points (2 and 3), we could also provide the schema with a list of used or unused metadata attributes for a given parameter, it could be a relatively cheap solution ? Would this be OK ? Data types may be another approach, but we have decided long ago to drop this notion, and Capsul <3 did already not have it, and was working anyway, so it's possible not to reintroduce it.

sapetnioc commented 10 months ago

Used/unused attributes per parameter is ok. This is what I was talking about (as well as forced attribute values). I was just thinking that this definition could be repetitive for many parameters sharing the same metadata attributes usage. So I wonder if it could be interesting to store some metadata usage profiles behind a single name to ease their reuse in several parameters. But, I do not have a concrete example such as Morphologist pipeline in front of me. Therefore it may be useless to go beyond explicit definition of metadata usage for each parameter.

denisri commented 10 months ago

I can try something, and tell if it's too fastidious...

denisri commented 10 months ago

It's more or less working, but it's still somewhat painful to write the rules, and even worse to read them afterwards. I have not pushed my changes yet because I'm not very happy with them (and I have not finished everything). I'd like to find an easier way to write them, even if the underlying system is not changed. The Morphologist rules are about 1000 lines of code (and they are still not complete), which suggests that it is not optimal. We cannot ask programmers to write that...

denisri commented 10 months ago

Point 1 is OK, I have implemented it for the Morphologist pipeline and subprocesses.

For points 2/3, I have implemented the used/unused metadata per parameter system. A ProcessSchema subclass can declare a variable metadata_per_parameter which specifies for parameters (with wildcard patterns) metadata which should or should not be used.

However there are advantages and drawbacks: Advantages:

Points 2 and 3 are solved. Some metadata not used by all parameters are not reset, removed, then set back to hard coded values inside a pipeline or process.
There are less confusions and unwanted interactions between metadata
All metadata can have a default value, even if they are not always used (like sulci_graph_version which is only used in sulci graphs, not in other outputs)
writing ProcessSchema parameters metadata is easier for this reason

Drawbacks:

As all metadata are not used, almost all ProcessSchema subclasses must declare a set of unused metadata applied to all parameters, then re-declare specializations for parameters which use more metadata. Ex, in Morphologist most parameters do not use sulci_graph_version and sulci_recognition_session. These matadata now have a defautl value, so all parameters except graphs should add them in their unused metadata:
```
    metadata_per_parameter = {
        '*': {'unused': ['subject_only', 'sulci_graph_version',
                         'sulci_recognition_session']},
        '*_graph': {'unused': ['subject_only',
                               'sulci_recognition_session']},
        '*_labelled_graph': {'unused': ['subject_only']},
    }
```
as a consequence, each time we add a new metadata with a default value in a schema for "new" pipeline outputs, we must add it to every ProcessSchema subclass defined for this schema. In other words, it is painful and non-incremental: we must modify all existing ProcessSchemas when modifying the schema, we cannot transparently extend it.
this makes several things to write for each ProcessSchema, which is not very easy for inexperienced pipeline developers.

sapetnioc commented 9 months ago

I've got an idea that could simplify the definition of schemas. It's still in its infancy and we'll have to work on it to see if it's viable. In my opinion, the difficulty in defining ProcessSchema derived classes comes from the large number of possibilities that we want to offer the Pipeline developer. So rather than inventing some kind of language encoded in dictionaries, we could simply ask the developer to provide a single method that would modify the metadata of all the pipeline parameters. The simplest methods, that simply set some metadata values, could be like that:

class MyProcessBids(ProcessSchema, 
                    schema='bids',
                    process=MyProcess):
    @staticmethod
    def set_metadata(metadata):
        metadata.a_parameter.suffix = 'a_suffix'
        metadata.another_parameter.suffix = 'another_suffix'

In that case metadata would be a special object allowing to ease the writing of metadata modification. For instance, we could allow to use wildcards or to define not to use some metadata field on some parameters with the following syntax:

        metadata['*'].process = 'MyProcess'
        metadata['image_*'].extension = 'nii'

        metadata.a_parameter.normalization.ignore = True

We could also imagine that these methods also receive metadata (for instance computed by inner processes of a pipeline) and can use them to propagate metadata values from an input parameter to an output parameter:

        metadata.output_parameter.side = metadata.input_parameter.side

I think we should try to write what could be this method for the most complex cases in Morphologist to see if it is an interesting idea or not.

sapetnioc commented 8 months ago

I am working on the simplification of process schema definition and have a question. Do we agree that it should not be not possible to customize path generation inside a pipeline ? If one links a node output to another node input without exportation, this is supposed to be an internal temporary value. In that case path generation is useless. Therefore, a pipeline should only be allowed to customize path generation of its external inputs and outputs but not its internal plugs.

denisri commented 8 months ago

I globally agree. A pipeline should be a black box which doesn't leave files not described in their parameters. However if I follow your meaning, do you plan to use completion only at pipelines top-level ? A process inside a pipeline may have its completion schema, which is useful for the pipeline if outputs are correctly exported, so "inside" completion should probably run anyway, but if it does, what does prevent it from assigning output filenames to non-exported parameters ? Well maybe the generate_temporary property will overload it. The drawback then is that path generation will run even when not needed.

sapetnioc commented 8 months ago

I plan to call process schema function on inner nodes first so that a pipeline can use the result for its own parameters. The schema only provides metadata, not file names. At the end of the process I plan to consider only the outer pipeline parameters to perform path generation. I think inner parameters that are not exported to the main pipeline should either be temporary or not written by a process because these files will be out of scope for any further processing by Capsul (no metadata, no history, no visualization, etc).

sapetnioc commented 8 months ago

In the new schema definition branch, I added the possibility to define a boolean used metadata for each schema field. This is used as the default value to decide to ignore this field in path generation. This allows to declare fields that have a default value but are ignored by default (unless explicitly marked as used in a process schema). For instance, with the following declarations there is no need to do anything about sulci_graph_version or sulci_recognition_session in process schemas unless we want to use these fields:

class BrainVISASchema(MetadataSchema):
    ...
    sulci_graph_version: field(type_=str, default="3.1", used=False)
    sulci_recognition_session: field(type_=str, default="default_session", used=False)
    ...

populse / capsul

defining processes and pipelines schemas is too complex #293