Open servoz opened 1 month ago
I think this answer is also related to #11.
To my opinion, the data related to datasets and the data used during execution are two different objects. The first one is permanent and should contain all information about what happened to existing files. The execution database contains an up to date processing execution status. It is is temporary and is erased when execution is done and the execution data had been included in the datasets data. This last transfer step between execution data and datasets data is what you called post-processing.
In Capsul 3 there is already a temporary execution database (internally it was redis, now it is a populse_db Storage, therefore it is a sqlite file) that can be accessed by both the client and the workers. The cleaning of the execution data requires two conditions. The client must have called dispose() on the execution hence saying that it doesn't need the data anymore and the execution must be finished. This cleaning is done either in worker code if dispose() was called before the end of execution or in user code if the user called dispose() after the end of the process execution.
We can imagine the same kind of user API for the post-processing of databases. We would consider that all data produced during execution are not in the datasets (even if data files will probably be created at their final location without having to copy them in post-processing). During the execution, all metadata would be managed only in execution database. To be able to have access to the data and metadata, the user should call a terminate() method that will extract information from execution database to put it in datasets metadata (and copy data files from a temporary remote location if necessary). Only advanced users would have to call explicitly terminate(). Most users woul use it via a higher level method such as run(...) or via a graphical interface.
With this separation between datasets and execution, it is easier to imagine user interface. There could be a first GUI dedicated only to datasets and pipelines. This GUI would allow the user to do whatever he wants with data and to select pipeline and prepare them for execution (i.e. select parameters). When the pipeline is validated or executed, it becomes a workflow that goes in the execution database. A second GUI would allow to follow and manage the execution of workflows. The execution would not change anything to datasets until terminate() is called on a workflow. This would wait for the end of the workflow and update datasets according to workflow result. Only after this step, produced data would be accessible to the user via the first GUI.
Within the context of clarifying what MIA should do to run pipelines, and linked to #167 and #180, I open a different issue not to add confusion on existing issues, and more to discuss what should be done.
We already have an initialization function, and a postprocess function (in #167 at least). I have questions about what and when things should be done:
initialization
postprocess
manage_brick_after_run
method, then it is called)which pipeline(s) to postprocess
For init, this is rather clear: init a pipeline, then run it (although there is the multiple init issue mentioned above). For postprocessing it is less clear. In a "synchronous" run, the client waits for pipeline execution to end, then postprocesses it - this is the "simple" case. But in a remote / asynchronous execution, the execution is handled in a soma-workflow server, and the client (mia) can be closed, the processing goes on and eventually finishes, then the client may be reconnected later. Several runs may be launched. So the postprocessing procedure should not be limited to a single, "current" pipeline run, but recover all runs that have been started. Moreover a pipeline run may have been stopped (by user interruption, technical problem) or failed, and restarted after a first postprocessing. In my opinion a pipeline run is still living as long as its workflow is still in soma-workflow database (or at least it completes entierly without error). How do we know/decide which pipelines have to be postprocessed ? How do we remember them after a client quit/restart ? Do we index runs in MIA database ? Or in soma-workflow ?