thesofakillers / nowcastlib

🧙‍♂️🔧 Utils that can be reused and shared across and beyond the ESO Nowcast project
https://giuliostarace.com/nowcastlib
GNU General Public License v3.0
0 stars 0 forks source link

Get rid of cascade-like nature of pipeline #7

Open thesofakillers opened 3 years ago

thesofakillers commented 3 years ago

Currently, because the pipeline assumes an order of operations, running an individual process (e.g. postprocessing) will also run all the individual processes leading up to it.

For example, suppose the user wants to run postprocessing. The pipeline will run preprocessing, synchronization and postprocessing in that order.

At the moment, the best way to keep things truly independent of previous processes is keeping the configuration for those previous processes to a minimum, so that minimal processing is performed.

This is however a bit cumbersome, as the user needs to open, edit and maintain different configuration files for different processes, which defeats the purpose of having a single configuration schema (the DataSet config struct).

The reason the pipeline works this way is that the output of a given process will serve as the input to the next process and the only input the user can specify in the configuration is the input to the first step of the pipeline, i.e. preprocessing. Therefore if a user wishes to run a process, all the processes leading before it need to run so that it receives the right input.


Ideally, the user should be able to have a very complete configuration (if they wished) but choose to run only a part of the pipeline by using the right CLI command and providing the necessary input themselves.

So, if the user wanted to postprocess a synchronized dataset that they already have, they would call nowcastlib postprocess with the relevant configuration and the path to the file they wish to postprocess.

Ideally, this would tell the pipeline to only perform postprocessing, rather than the current form in which preprocessing and synchronization are performed beforehand.


Each subprocess cli command should therefore take at least one additional (optional) argument -i or --input where the user can specify the path to an input file to use, so to be able to skip all the previous steps