💡Some ideas regarding storage (DB) concepts

@liamhuber, as we have just discussed please find below a sketch of the envisioned storage concept. The main idea is to have two databases/stores, one for the nodes, macros, workflows (left side) and one for the data resulting from running the node, workflow etc.

Follow up content from a conversation between me, @pmrv, and @JNmpi:

One can imagine two fundamentally different types of table: one for storing classes, i.e. templates to use in your new work, and one for storing instances, i.e. coupling a particular node with particular input (hash & storage location) to particular output (storage location). This is already reflected in the NodeDB and InstanceDB above.

For NodeDB we would imagine queries where a node class is given, and related node classes are returned -- e.g. alternate versions of the same node, or macros that contain that node somewhere in their graph.

For InstanceDB we the fundamental table might be something like "Node specifier" (package+class+version), "input hash", "storage location". So each time a node is run and saved, it registers these things with the DB. (One can imagine refactoring this database, like levels above that that is (package, class, subtable), (version, subsubtable), where subsubtable is the one holding node_specifier, input_hash, location or what have you). Some ideas of the sorts of queries we might make

Very generic: "give me a table of all executions of ThisNodeClass, including any version and any macro that has it in the graph somewhere"
Or less generic: "give me a table of all the executions of ThisNodeClass, then look at their input and filter to give me only those with input.x < 42 -- or they might be very specific
Hyper specific: "If this particular version of NodeClass has been executed, and the input hash matches this_value, load the output and return it to me.

The last case shows how such a DB could be used to accelerate workflows where the compute cost is high relative to the storage cost; conversely, if the compute cost is trivial we might never want to store it in the instance DB!

Note also that if the node output is stochastic, responsibility then falls to the node designer to make sure the random seed is also included in the IO. This wouldn't be perfect because I guess there is some hardware- and cosimic ray-dependence, but it would let us store multiple copies of output the same nominal input -- particularly important for, e.g., molecular dynamics.

These thoughts are neither complete nor, relative to the sketched figure above, particularly novel -- I just wanted to add some detail, as I understood it, following our very nice chat.

@JNmpi, @jan-janssen and I had a nice meeting on this topic this morning, and Joerg has content already pushed in the pyiron_nodes repo.

Here are my takeaways for the target spec:

Columns (can refactor these into separate tables as needed; most of these are in Joerg's prototype):
- critical A node class identifier -- for now just the python library path, but versioning is important
- critical A hash combining the node class identifier and current input
- critical A column allowing us to retrieve the output (either a raw representation, or direction to a file)
- Blacked source code and/or a hash thereof
- A JSON or similar representation of input
- pyiron_workflow version used at creation (or last update?)
- Creation date
- Last-access date
- Size on disk of any associated file/directory/tarball
@JNmpi's: if the workflow input is not human readable, then it is not FAIR in a meaningful way, thus we should only allow a certain subset of inputs to be used, e.g. str, int, float, lists and tuples of these, ...
Key feature of Joerg's prototype: input can also be the hash of another node -- this allows database storage of nodes that have complex and otherwise unstorable input, and gives the chance to reconstruct whole workflows directly from the database
@jan-janssen: for QoL and maintainability, this hashing/caching should be independent of hashing/caching the live workflow, i.e. this issue is independent of #169, although we may be able to reuse some of the hash-generating code and they may sometimes fulfill the same purpose of avoiding a re-run
How to actually convert to the database, what is the opt-in process?
- Current prototype by @JNmpi: there is a separate database run command the node gets passed to, but this means graph execution is totally manual and tools like .iter are incompatible
- @liamhuber: Maybe nodes could carry a database and overwrite_database properties+input kwargs, allowing them to opt in, and then saving happens as part of regular execution
- This is tricky to combine with the ability to store complex data as input links -- we would need some automated way of a-priori or post-facto working out the path of nodes the user didn't specify that need to be stored until all upstream nodes have permissible input for storage
- @jan-janssen: Maybe we could pass executed workflows/node sets post-facto
- This shares the missing-link issue from above, but since everything is finished it's probably easier to solve it here
- Next step: Forget storing children that require complex input links for now, to start with only allow sending nodes to a database whose input values are compliant with our restrictions -- it's limited, but it gets the ball rolling
We are not super concerned about the size of the database or hash-match queries because...
- If it stays opt-in at some level, there should not be too many nodes (need to watch this if we start automatically tracing dependency trees and storing paths of nodes until we hit a layer with completely compliant input)
- This sort of hash matching is super optimized by the database tools (probably by maintaining the hashes in a sorted list to make lookup fast at the cost of injection being a bit slower?)

@jan-janssen, @JNmpi did I miss any key points?

When I read through the posts in this issue I wonder: Are they connected? compatible? In principle the idea of a class database and an instance database seems to be missing in the recent summary. Maybe a concrete example can help. What is the minimal database implementation which would already benefit a specific use case? The fitting of machine learning potentials with different hyper parameters like number of basis functions, different training sets and the validation with tools like calculating energy volume curves. So it's basically a three component workflow and by hashing the different components the connections get more clear.

Thanks @liamhuber for nicely summarizing our discussion. A simple approach to mark which input is regarded as human readable and can serve as an endpoint would be to mark (in addition to simple data types such as strings, int, etc.) certain dataclass definitions as compatible. An example would be a VASP incar datatype, whereas a dataclass storing atomistic structures or trajectories would be rejected. We thus should provide support for pyiron expanded dataclass objects that contain such internal (ontological) information and hints.

I also feel that in most practical use cases, the complexity of node usage decreases with increasing level. For example, a top-level macro node will only have input and output, no cyclic workflows, etc. So I am pretty confident that in most practical cases we will only need the hash-based node storage for "simple" workflows connecting high-level macros, which themselves can be highly complex. So we will not have to worry about cyclicity, multiple outputs going into a single input, etc. I agree with @jan-janssen that real-world examples are crucial, and this is what I intended with the examples in the Jupyter notebook. They helped me a lot to better understand what structures/concepts we need.

pyiron / pyiron_workflow

💡Some ideas regarding storage (DB) concepts #126