SimonHeybrock commented 1 year ago

Overview

There is a significant number of cross-technique and cross-workflow requirements. We need to avoid developing too many flavors of the same thing, as it will complicate understanding the code, testing, and maintenance. A template or design to follow would provide guidance for development of new workflow, and will avoid overhead from having to make the same decisions over and over again.

Examples:

How to store (in memory) and find monitors or time-series logs, in relation to their data?
How to make a workflow support processing multiple files?
How to make a workflow that works with the (future/hypothetical) auto-reduction system?
How to store and carry the various parameters of a reduction workflow?

Notes

scipp.DataGroup will be available soon. It may play a central role in this.
It may be beneficial to define some interfaces, e.g., using abstract base classes, as it will make it very clear what needs to be implemented.

SimonHeybrock commented 1 year ago

A list of potential requirements for workflows. This is just a collection of everything we can think of, to serve as a starting point, not a list of actual requirements. This must be curated:

Support high-level operation with one or a few function calls.
Support low-level operation were one or more of the components can be modified, or individual steps can be run and results can be inspected.
Logging.
Progress reporting.
Run on multiple inputs.
Run in auto-reduction.
Run in batch-reduction.
Run for live-reduction.
Run interactively (Jupyter).
Run in parallel (using dask).
Run via GUI.
Deal with limited memory (data too large).
Allow for inspection of intermediate results.
Parameters configured in config file.
Parameters configured in notebook.
Simple replacement of individual reduction steps as well as insertion of additional steps.
Configurable options, e.g., for optional steps, or different branches in the workflow.
Deal with multi-technique instruments, e.g., diffraction + SANS
Automatic generation of workflow graphs, similar to what sc.transform_coords supports via sc.show_graph.
Take user input in intermediate steps, e.g., to select ROI.

jl-wynen commented 1 year ago

Found out that orsopy can store some logs and this is used in parts by the Amor reduction. In general, we need to see how to integrate this kind of functionality with other packages.

arm61 commented 1 year ago

The logs that orsopy takes are specific to the expectations of the ORSO file format. And essentially at the moment it is hardcoded in the Amor reduction workflow (i.e. if step x is run, y is added to the ORSO data object). I suspect there is a more elegant solution to achieving this though.

SimonHeybrock commented 1 year ago

Incomplete, but this was the result of yesterday's discussion:

Colors are:

Red: Building/modifying workflows
Yellow: Configuration/parameters
Blue: Workflow outputs

SimonHeybrock commented 1 year ago

Another practical challenge is how to define input parameters and how to channel them to the currect workflow step. Consider for example Mantid's DgsReduction. It has tens of input parameters that are required at various steps of the reduction: (truncated)

Maybe a useful way of considering this is by looking at Blender's node editor:

Our workflows are similar to those pipelines. Many parameters are needed at various stages. Many of them have default values. Others are optional. In the graphical node editor it is very obvious to the user what is used where, and they can provide the values directly. Our current approach (Mantid or the existing workflows in this repository) attempt to "flatten" this representation, into a single big list of input parameters, that has to rely on naming, descriptions, and comments to obtain meaning. Is there a better way? Should we at the very least preserve structure in the parameter map? But even that does not feel intuitive.

jl-wynen commented 1 year ago

A naive way of encoding params that includes the location where they are used is a dict from workflow step name to parameter. This would work for simple workflows but not for complex ones where some steps appear multiple times, e.g. there are multiple monitor-normalisations.

I think the only really clear way is to use a structure that directly mirrors the workflow. E.g. a graph or simply a Jupyter notebook where params are passed to the functions directly. But the only way to make this really apparent and easy to understand would be a graphical tool as in Blender. Doing this in code (Python, YAML, ...) will always have to flatten the structure to some extend because code is inherently linear.

SimonHeybrock commented 1 year ago

I think the only really clear way is to use a structure that directly mirrors the workflow.

Yes, that is my current thinking as well. We'd have some sort of nested structure of dataclasses (or similar) with parameters.

jl-wynen commented 1 year ago

This still has a problem. Such a structure is tree-like. So if a parameter is shared between multiple steps (e.g. number of wavelength bins), that parameter need to be duplicated. And when changing it, the user has to change it in all places.

If we want to use something like dataclasses, this could be solved to having a shared mutable structure in the dataclasses. But that can be messy.

What I was more referring to was an actual graph that allows for (undirected) cycles. E.g. the number of wavelength bins could itself be a vertex which all interested steps depend on. (Like the two 'Value' boxes in your Blender example.

SimonHeybrock commented 1 year ago

What I was more referring to was an actual graph that allows for (undirected) cycles. E.g. the number of wavelength bins could itself be a vertex which all interested steps depend on. (Like the two 'Value' boxes in your Blender example.

Isn't that equivalent to the flat list that, e.g., the Mantid workflows use, with all the problems that come without having structure?

jl-wynen commented 1 year ago

How so?

SimonHeybrock commented 1 year ago

How so?

Can you give a code example, of how a user would set parameters in this graph?

jl-wynen commented 1 year ago

Like I said, code is linear. So no matter what structure we use, it will always be flat (except for a potential tree-structure like YAML / nested dicts).

I_of_q = make_workflow()
n_bins = find_wavelength_bin_param(I_of_q)  # needs to be implemented by searching through the graph
n_bins.value = 100
result = I_of_q.compute()

SimonHeybrock commented 1 year ago

So as I said above: Your "graph" suggestion is the "same" as the flat structure used by Mantid workflows. That is why I, too, suggest using nested structures of parameters (dataclasses, or as you mention plain nested dicts), to allow avoid that.

jl-wynen commented 1 year ago

I disagree. The graph is not flat but nested and potentially even cycles. The access is flat because it is done in Python.

But we should discuss this in person instead of with hand wavy arguments here.

SimonHeybrock commented 1 year ago

I disagree. The graph is not flat but nested and potentially even cycles. The access is flat because it is done in Python.

The users usually do not see the graph when setting parameters. From this point of view it is irrelevant what structure is used underneath.

SimonHeybrock commented 1 year ago

We are now at a point where Sciline has all the basics. We can thus go back to the requirements table and see which boxed we can tick off. For what is left, evaluate whether it should extend Sciline, or whether it will go elsewhere.

SimonHeybrock commented 1 year ago

Write design document, leading to implementation of Sciline.

scipp / ess

Workflow design pattern #157

Overview

Examples:

Notes