[general discussion] Initialization / run / postprocess

populse / populse_mia

Multiparametric Image Analysis

Other

0 stars 0 forks source link

Within the context of clarifying what MIA should do to run pipelines, and linked to #167 and #180, I open a different issue not to add confusion on existing issues, and more to discuss what should be done.

We already have an initialization function, and a postprocess function (in #167 at least). I have questions about what and when things should be done:

initialization

check that the pipeline is OK
check requirements and config
complete parameters
index processes/bricks and future output data ? This is done by this time in the current code. Processes which will run are registered in the database, and all their ouput data also. A record of their init time is kept. However this is not clear in my mind why this has to be done before running. Processes have not run yet; output data do not exist yet, so why registering them in the database ? Why is it interesting/important to record init time ? A pipeline may be initialized several times (with or without being run later), and currently we get an growing list of init records for the same (possibly non-existing) data. Is it really OK ? I feel it's confusing for the user, and things get messy...

postprocess

Processing nodes are postprocessed: if they have a manage_brick_after_run method, then it is called)
data indexation: currently this is done during init. But I feel output data should be indexed after the processing has ended, and when data actually exist, so this may be the right time ?
processes indexation: as for data currently this is done during init, but I feel this should be done after the process has run.

which pipeline(s) to postprocess

For init, this is rather clear: init a pipeline, then run it (although there is the multiple init issue mentioned above). For postprocessing it is less clear. In a "synchronous" run, the client waits for pipeline execution to end, then postprocesses it - this is the "simple" case. But in a remote / asynchronous execution, the execution is handled in a soma-workflow server, and the client (mia) can be closed, the processing goes on and eventually finishes, then the client may be reconnected later. Several runs may be launched. So the postprocessing procedure should not be limited to a single, "current" pipeline run, but recover all runs that have been started. Moreover a pipeline run may have been stopped (by user interruption, technical problem) or failed, and restarted after a first postprocessing. In my opinion a pipeline run is still living as long as its workflow is still in soma-workflow database (or at least it completes entierly without error). How do we know/decide which pipelines have to be postprocessed ? How do we remember them after a client quit/restart ? Do we index runs in MIA database ? Or in soma-workflow ?

I think this answer is also related to #11.

To my opinion, the data related to datasets and the data used during execution are two different objects. The first one is permanent and should contain all information about what happened to existing files. The execution database contains an up to date processing execution status. It is is temporary and is erased when execution is done and the execution data had been included in the datasets data. This last transfer step between execution data and datasets data is what you called post-processing.

In Capsul 3 there is already a temporary execution database (internally it was redis, now it is a populse_db Storage, therefore it is a sqlite file) that can be accessed by both the client and the workers. The cleaning of the execution data requires two conditions. The client must have called dispose() on the execution hence saying that it doesn't need the data anymore and the execution must be finished. This cleaning is done either in worker code if dispose() was called before the end of execution or in user code if the user called dispose() after the end of the process execution.

We can imagine the same kind of user API for the post-processing of databases. We would consider that all data produced during execution are not in the datasets (even if data files will probably be created at their final location without having to copy them in post-processing). During the execution, all metadata would be managed only in execution database. To be able to have access to the data and metadata, the user should call a terminate() method that will extract information from execution database to put it in datasets metadata (and copy data files from a temporary remote location if necessary). Only advanced users would have to call explicitly terminate(). Most users woul use it via a higher level method such as run(...) or via a graphical interface.

With this separation between datasets and execution, it is easier to imagine user interface. There could be a first GUI dedicated only to datasets and pipelines. This GUI would allow the user to do whatever he wants with data and to select pipeline and prepare them for execution (i.e. select parameters). When the pipeline is validated or executed, it becomes a workflow that goes in the execution database. A second GUI would allow to follow and manage the execution of workflows. The execution would not change anything to datasets until terminate() is called on a workflow. This would wait for the end of the workflow and update datasets according to workflow result. Only after this step, produced data would be accessible to the user via the first GUI.

populse / populse_mia