Open denisri opened 10 months ago
For 1. I see (reading the code) that a dataset name can be assigned to a parameter in its field metadata. I don't know if it's a complete solution, but at least we have something to do it.
There was a huge change in process completion paradigm and two changes makes backward compatibility difficult:
We may have to reintroduce a kind of data type that would not capture all the diversity of the data but could define a metadata attribute usage model. Each model could contain some forced metadata values (allowing to force undefined values for attributes that are not to be used). To my opinion, these model names should be known by both pipelines and data schema. In pipeline, they would be simple string defined in field. They would define a partition of parameters according to their metadata need for path generation. Each dataset schema would contains an actual definition of attribute usage rules for each model name.
For point 1. default dataset of a parameter is either input
or output
. My idea was that the pipeline creator would precise the datasets name for other values (for instance spm
for SPM templates or shared
for value taken in shared data). I do not know if there are cases that are not covered by that.
And point 4 is not implemented yet. We need to put back formats in Capsul v3 and to link them to configuration to get default values.
Point 1 should be resolved in fields metadata, yes, I'm currently trying this solution. For other points (2 and 3), we could also provide the schema with a list of used or unused metadata attributes for a given parameter, it could be a relatively cheap solution ? Would this be OK ? Data types may be another approach, but we have decided long ago to drop this notion, and Capsul <3 did already not have it, and was working anyway, so it's possible not to reintroduce it.
Used/unused attributes per parameter is ok. This is what I was talking about (as well as forced attribute values). I was just thinking that this definition could be repetitive for many parameters sharing the same metadata attributes usage. So I wonder if it could be interesting to store some metadata usage profiles behind a single name to ease their reuse in several parameters. But, I do not have a concrete example such as Morphologist pipeline in front of me. Therefore it may be useless to go beyond explicit definition of metadata usage for each parameter.
I can try something, and tell if it's too fastidious...
It's more or less working, but it's still somewhat painful to write the rules, and even worse to read them afterwards. I have not pushed my changes yet because I'm not very happy with them (and I have not finished everything). I'd like to find an easier way to write them, even if the underlying system is not changed. The Morphologist rules are about 1000 lines of code (and they are still not complete), which suggests that it is not optimal. We cannot ask programmers to write that...
Point 1 is OK, I have implemented it for the Morphologist pipeline and subprocesses.
For points 2/3, I have implemented the used/unused metadata per parameter system. A ProcessSchema
subclass can declare a variable metadata_per_parameter
which specifies for parameters (with wildcard patterns) metadata which should or should not be used.
However there are advantages and drawbacks: Advantages:
sulci_graph_version
which is only used in sulci graphs, not in other outputs)ProcessSchema
parameters metadata is easier for this reasonDrawbacks:
ProcessSchema
subclasses must declare a set of unused metadata applied to all parameters, then re-declare specializations for parameters which use more metadata. Ex, in Morphologist most parameters do not use sulci_graph_version
and sulci_recognition_session
. These matadata now have a defautl value, so all parameters except graphs should add them in their unused metadata:
metadata_per_parameter = {
'*': {'unused': ['subject_only', 'sulci_graph_version',
'sulci_recognition_session']},
'*_graph': {'unused': ['subject_only',
'sulci_recognition_session']},
'*_labelled_graph': {'unused': ['subject_only']},
}
ProcessSchema
subclass defined for this schema. In other words, it is painful and non-incremental: we must modify all existing ProcessSchema
s when modifying the schema, we cannot transparently extend it.ProcessSchema
, which is not very easy for inexperienced pipeline developers.I've got an idea that could simplify the definition of schemas. It's still in its infancy and we'll have to work on it to see if it's viable. In my opinion, the difficulty in defining ProcessSchema
derived classes comes from the large number of possibilities that we want to offer the Pipeline developer. So rather than inventing some kind of language encoded in dictionaries, we could simply ask the developer to provide a single method that would modify the metadata of all the pipeline parameters. The simplest methods, that simply set some metadata values, could be like that:
class MyProcessBids(ProcessSchema,
schema='bids',
process=MyProcess):
@staticmethod
def set_metadata(metadata):
metadata.a_parameter.suffix = 'a_suffix'
metadata.another_parameter.suffix = 'another_suffix'
In that case metadata
would be a special object allowing to ease the writing of metadata modification. For instance, we could allow to use wildcards or to define not to use some metadata field on some parameters with the following syntax:
metadata['*'].process = 'MyProcess'
metadata['image_*'].extension = 'nii'
metadata.a_parameter.normalization.ignore = True
We could also imagine that these methods also receive metadata (for instance computed by inner processes of a pipeline) and can use them to propagate metadata values from an input parameter to an output parameter:
metadata.output_parameter.side = metadata.input_parameter.side
I think we should try to write what could be this method for the most complex cases in Morphologist to see if it is an interesting idea or not.
I am working on the simplification of process schema definition and have a question. Do we agree that it should not be not possible to customize path generation inside a pipeline ? If one links a node output to another node input without exportation, this is supposed to be an internal temporary value. In that case path generation is useless. Therefore, a pipeline should only be allowed to customize path generation of its external inputs and outputs but not its internal plugs.
I globally agree. A pipeline should be a black box which doesn't leave files not described in their parameters.
However if I follow your meaning, do you plan to use completion only at pipelines top-level ?
A process inside a pipeline may have its completion schema, which is useful for the pipeline if outputs are correctly exported, so "inside" completion should probably run anyway, but if it does, what does prevent it from assigning output filenames to non-exported parameters ? Well maybe the generate_temporary
property will overload it. The drawback then is that path generation will run even when not needed.
I plan to call process schema function on inner nodes first so that a pipeline can use the result for its own parameters. The schema only provides metadata, not file names. At the end of the process I plan to consider only the outer pipeline parameters to perform path generation. I think inner parameters that are not exported to the main pipeline should either be temporary or not written by a process because these files will be out of scope for any further processing by Capsul (no metadata, no history, no visualization, etc).
In the new schema definition branch, I added the possibility to define a boolean used
metadata for each schema field. This is used as the default value to decide to ignore this field in path generation. This allows to declare fields that have a default value but are ignored by default (unless explicitly marked as used in a process schema). For instance, with the following declarations there is no need to do anything about sulci_graph_version
or sulci_recognition_session
in process schemas unless we want to use these fields:
class BrainVISASchema(MetadataSchema):
...
sulci_graph_version: field(type_=str, default="3.1", used=False)
sulci_recognition_session: field(type_=str, default="default_session", used=False)
...
Completion schemas definition is not simple enough at the moment, and it suffers several problems:
ProcessMetatada
is instantiated by the end user or application. Its constructors needs to be given adatasets
parameter which specifies which dataset is assigned to each process parameter (input
,output
,shared
...). In most situations the user doesn't know about it and doesn't want to specify them. Moreover a generic application cannot have the list of all parameters of all processes for that, so processes should provide at least a default for this.Some parameters in a process do not use all the schema metadada attributes in their filenames. OK this is bad and a "good" schema should use them all, but the current BrainVisa/Morphologist schema does not and we need some compatibility with it. The filename generation function is agnostic to the parameter and thus doesn't bother about this problem, it always produces filenames with all attributes if they have a value. So the
ProcessSchema
subclasses need to specify all this, parameter by parameter, and sometimes have to erase attributes values, and even worse, sometimes have to set values back after they have been erased for other parameters. Ex incapsul.brainvisa.schemas
, we need to write things like:in the 1st class here we need to erase
prefix
, while in the second we need to restore (by a very unconvenient trick)analysis
which has been erased earlier. We need to find a simpler syntax or system, maybe allowing to tell that thehead_mesh
parameter inScalpMesh
does not use theprefix
attribute, rather than modifying it in an unintuitive way, which has side effects later. This info would be useful also in order to simplify GUIs and not display to the users attributes which are not used, or only internally defined and used (like theside
attribute inside aMorphologist
pipeline).sulci_recognition_session
attribute in the BrainVisa schema is only used in identified sulci graphs parameters in Morphologist and other processes, and not in others. So it cannot have a default value (otherwise it would appear in all filenames of all parameters). Right now its value is set, forced, in theProcessSchema
for the labeled graph parameters, but so the user (or program) cannot change it at all.