save pipeline structure with parameters for reproduction

dschick commented 8 months ago

Hi scliline team,

First of all, many thanks for this great package. We were currently thinking of designing a similar system for pipeline data processing based on xarray data containers and luckily found your work before writing a line of code.

In some of our data analysis tasks, we have some rather expensive produces (e.g. phase retrieval methods for X-ray holography) and in addition to the actual result of a pipeline, we would also like to save how we got there.

Obviously, we could save the list of producers and parameters on our own, but how about a dedicated method of the pipeline class similar to visualize(), which could return not only the structure of the graph but also the actual parameter values?

In addition to the names of the producers, one could also think of saving their source code and/or a hash of it for full reproducibility. This feature could be extremely helpful during beamtimes, when code is often changed during online analysis.

Best

Daniel

jl-wynen commented 8 months ago

This was on our list of requirements when designing Sciline but it had a low priority lately. So thanks for the reminder!

As you said, we will likely have to write the graph and parameter values. There are some open questions, though:

What graph format do we use? Are dot files (which visualize effectively uses) enough? This may link to the discussion in #43.
Do we write the parameters into a separate file? If so, what format? And keep in mind that this needs to work for all parameter types, scipp, builtin, xarray, etc.

In addition to the names of the producers, one could also think of saving their source code and/or a hash of it for full reproducibility.

This is an interesting idea. But it would be an incomplete solution because providers typically call additional functions and we couldn't reasonably write their source code or hash, too.

In our case, we expect to have a script or Jupyter notebook that defines the graph and possibly some specialised providers as well as one or more packages that define most providers. My assumption was that we at least write the precise version of al relevant packages (or a full pip freeze or conda list). And, at least for code controlled by us, the full script or notebook that defines the pipeline. Those files can then be archived in SciCat together with the processed data. But I admit that this requires some work from the pipeline author and only really works when we can associate all files with each other with a catalogue like SciCat.

dschick commented 8 months ago

It's great that you have a similar interest here :) And obviously, your thoughts are already more advanced than mine.

I'll be happy to discuss this further at any point. Feel free to close this issue for now.

Best

Daniel

jl-wynen commented 8 months ago

I'll keep it open as a reminder.

I'd be happy to hear your insights into how you and your users want to handle provenance and what requirements you have!

SimonHeybrock commented 8 months ago

My idea so far was to store the graph in a Sciline-independent manner. "Producers" and "Parameters" are strictly speaking an implementation detail of Sciline, so one would not want to rely on this for long-term archiving of data, FAIR data, ...

The computational graph is hopefully more meaningful (when combined with input parameters). So we should look into how this can be stored in a generic manner. I don't know if studying, e.g., how Snakemake handles this can provide some guidange.

SimonHeybrock commented 7 months ago

Conclusion for now:

Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.
As a first step, let us implement a simple way of serializing/storing the graph, this should not be blocked by item 1.)
Secondly, think about/implement a "good enough for now" solution for parameters, maybe large parameters are uncommon and we can simply ignore that problem for now.

jl-wynen commented 7 months ago

To get an overview of some formats used in practice, take a look at https://networkx.org/documentation/stable/reference/readwrite/index.html

From this list, I'd prefer json or possibly adjacency list / multiline adjacency list. The former in particular, because it makes it easy to also store parameter values without inventing a new format.

jl-wynen commented 7 months ago

Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.

Can you explain why parameters might be large? I thought they would only be single/few numbers or strings. All large data would be read from a file.

SimonHeybrock commented 7 months ago

Understanding how to store parameters (They may be large? Should we store hashes? ...) will require more thought.

Can you explain why parameters might be large? I thought they would only be single/few numbers or strings. All large data would be read from a file.

A parameter can be anything. For example you can process an intermediate result, set it as a parameter, and create a new task graph. Parameters can thus be anything, can have arbitrary size, and they might not be serializable at all.

jl-wynen commented 7 months ago

Are there any objections to using the json format described by networkx? If not, I'll implement that.

SimonHeybrock commented 7 months ago

JSON sounds good!

jl-wynen commented 6 months ago

First part done in #124. Now we need to figure out how to handle parameters.

scipp / sciline

save pipeline structure with parameters for reproduction #92