populse / capsul

Collaborative Analysis Platform : Simple, Unifying, Lean
Other
7 stars 14 forks source link

Is it really necessary to deactivate a node when not all output are exported? #252

Closed servoz closed 1 year ago

servoz commented 1 year ago

As I said in the last meeting, unless I am mistaken, currently if the user does not export the output of a brick/node, that brick/node is automatically disabled (carpetparcellation).

Screenshot from 2022-10-28 14-24-08

This is not the case if there are two outputs and the user exports at least one of the outputs (tsnr).

Screenshot from 2022-10-28 14-26-33

I am not sure if this is correct. Indeed in the case of large pipelines, the user may want the brick/node is runed (that the output is created) but that this output is not visible in the output of the pipeline encompassing this brick/node, in order to avoid having too many inputs and outputs for the global pipeline.

In my opinion, a brick/node should only be disabled when a mandatory input is not exported (i.e. data needed for calculation will be missing), no matter of output.

The question is: should we expose all input and output parameters? At the risk of having unnecessarily heavy global bricks/nodes for very large pipelines.

denisri commented 1 year ago

Well, this is really the behaviour we have expected (and designed to do so). The logics here is that, if a process does not produce any output which is used (either by a downstream node, or in the main pipeline outputs), then it does not need to run. The mechanism which disables it is exactly the mechanism which allows to disable unused branches when a switch is used, so if we change it, switches will not work any longer. On the other side I understand your point, but if the output is not connected, then it's not possible for the user of the whole pipeline, when used as a node in a larger pipeline, to set (or even read) this node parameter, which is not visible. So either it will not write anything (thus the process should actually not need to run: the current behaviour), or it will write a file with a hard-coded name, or a filename decided internally by the process, which the user will not know about. This is 1. not clean as files will be written at places we don't really know about, 2. not compatible with any kind of data workflow tracking: database indexing will not index this file which is "invisible" [well this is perhaps done anyway in MIA because all sub-processes are parsed for data indexation], data transfers will not transfer it, it will not have any provenance and history tracking, etc. So the clean solution is: really export it. It makes pipelines somewhat heavier, yes, that's true, but when it references actually produced and existing files, it is the right thing. One rule we have implicitly said in Capsul design (and even explicitly said, but maybe not written ;) ) is that a process or pipeline should not write files that are not referenced in its main parameters.

servoz commented 1 year ago

OK.

That's one point of view. There are quite a few processes that generate additional parameters to those declared for output. The fact that we can't export output, if at least one output is exported, doesn't allow to have an immutable rule ...

We already live with that.

Honestly, I work on a pipeline where even I am afraid when I look at the corresponding global node if all parameters are exposed... I'm afraid that users won't understand anything at first, especially since the purpose of the pipeline is to take only on input data to generate only one output data (all other data are not the purpose of this pipeline, they are just generated for the most part to create the final data).

sapetnioc commented 1 year ago

No, you do not have to expose all files created by a process. Exposing only one "useful" output is legitimate. But, remember that pipelines are supposed to be executable on a remote machine with a different file system (using a single machine should be seen as a special case). Therefore all internal files written bu a process must be temporary and the process should always clean them before exit (event in case of exception). As a consequence, for Capsul, the only things produced by a process are exposed outputs. Therefore, if none of the output is used (either connected to an active node or exported to the outer pipeline output), Capsul consider that the process does not have to be executed.

But if you want to cheat a little bit, you do not really need export all your process parameters to the pipeline. Only one is enough. However, as Denis said, hidden files will never be taken into account in any databasing or transfer system.

That said, your comment about processes with many non relevant output has to be taken into account. You say it is painful for the developer, who must write many parameters, and for the user who will be lost with useless parameters.

For the user, we can deal with that as in BrainVISA. A user level is associated to each parameter metadata (for instance, 0 = basic, 1 = advanced, 2 = expert, 3 = debug) and a user select its user level in application configuration. The GUI then shows only parameters whose user level is less or equal to the user use level. But this is not implemented in MIA.

For the developers, we may use a directory parameter that would be set (eventually as output by the process itself in Capsul v3) to a directory containing all unexposed outputs. Or we could introduce processes with side effect that would never be deactivated and maybe restrict their usage in some contexts (such as a local machine).

servoz commented 1 year ago

It's true that this cleaning thing is a bit foreign to me at the moment because I work exclusively locally.

But given the time needed for the pipeline I am currently working on, it will certainly be necessary to move to a large server if we want to do the calculation on several patients!

So it's not possible to retrieve the non-exported data if we do the calculation remotely? This is perhaps a pity. Is it not a good idea to retrieve everything in the output_directory?

Well, it's true that I haven't looked at the machinery in the case of a remote calculation yet. It will be necessary to do it ASAP!!!

Yes I thought about the user_level parameter but actually in Mia it's hard-coded in the process constructor, so it's tempting to use that as the parameter will always be hidden (but here we don't want to see the parameter only in some cases).

Anyway, I have already managed to reduce the number of exposed parameters and the pipeline should not scare users too much anymore, now :-))

However I think this issue is real. We can indeed live with it, but I think it is real.

denisri commented 1 year ago

Actually if at least one output is exported, then the process needs to run in order to produce this output, and thus will not be disabled. And, yes, non-exported output parameters (files) will be written, provided they have a value. If we go back on this behaviour, and allow processes with all unconnected outputs to be enabled, then the whole nodes activation system will be broken, switches and other node disabling features will not work any longer. So I don't think it is a really good option. The pipeline developer may also use parameters groups to "hide" less important parameters to users.

An alternative would be to allow disabling only processes with no output value (not just unconnected), but the activation system implemented currently is using connections, and not values, because values are set later, once the pipeline structure (activated nodes) is established. And then a process with a default output value will never be disabled, even if part of a branch which should be otherwise disabled (via a switch or another means) - this would be a consequent redesign of Capsul features. Plus, a pipeline could not be seen as a black box with all its inputs/outputs visible. And, once again, indexation and transfer would become, at least, more complex.

Of course all this can be discussed (it would not be the first, nor the last, redesign of Capsul... ;) ) but I think having a way to know what a brick or pipeline actually writes is not a bad thing, thus these parameters should actually be parameters, but maybe hidden at first sight until we ask to see them. And, @servoz is right: if some outputs are connected, but not all of them, and the unconnected ones still get a value, then there are actually files which will get written outside of user control, thus the logics we have exposed is already defeated...

servoz commented 1 year ago

Thank you for all your answers and information! I need to think after all that you have written :-)) I'll reopen this ticket and I'll come back a little later.

I'm sure it won't be for another week or two, as I'm on vacation next week and the following week will be very busy.

But I will come back here because I think it is an important and interesting topic!

servoz commented 1 year ago

I apologize, I really thought I'd come back to this ticket earlier ...

I've just reread the various comments.

To summarize:

I fully understand the technical difficulties and the reluctance to change things (when something works, it's better to think carefully before changing it, because the improvement may end up being a global regression!).

However, I'm still convinced that it can be interesting that a process generate a file without it being exposed in the global pipeline (this is already the case for some files with SPM/nipype, although it's even worse because it's generated totally hidden, without exposing the output in the process!, whereas what interests me is to just simplify the number of pipeline outputs (which really makes sense for very large pipelines) while maintaining the possibility for the user to see what is going on inside by doing for example an edit sub-pipeline (right click on the pipeline node).

Well, maybe one solution to avoid changing things too much is to use the userlevel metadata. We already use it in Mia in a general way (for example, for matlab command that we don't want to expose) and in mia_processes when we don't want a parameter to be exposed in the controller (we define it in the process constructor). Perhaps the pipeline developer could change the userlevel of the corresponding trait when defining the pipeline in the pipeline_definition method? That sounds like a pretty interesting way ... I'll have to give it a try, as I've never used this metadata like this before. I'm still having a bit of trouble figuring out what it's going to do in the end... I'll test it and let you know.

denisri commented 1 year ago

userlevel just hides parameters from the GUI, and does not change any activation mechanism and runtime behaviour. So it might be the right solution.

servoz commented 1 year ago

This is indeed the best solution.

I just tested that in the pipeline class where is the process in which we want hide the output, we can use something like : self.nodes["process_name"].process.trait('out_plug_name').userlevel = 1

Ok at first I was looking for a solution for users, it's more a solution for developers (need to look at the codes) but I'm fine with it. Anyway, if this need is really useful, we can imagine adding a GUI option to do it. But I don't think that's the need right now.

It do the job perfectly with Mia's V2 controller.

The only issue I can see is that with the v1 controller, the output remains visible in the GUI controller (there's only suppression in the nodes).

So that's another problem. We have to fix it in populse_mia (V1 is specific to populse_mia and V2 is delegated to capsul).

For me it's fine. I think we can close this ticket!