pyiron / pyiron_workflow

Graph-and-node based workflows
BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

💡Some ideas regarding storage (DB) concepts #126

Open JNmpi opened 8 months ago

JNmpi commented 8 months ago

@liamhuber, as we have just discussed please find below a sketch of the envisioned storage concept. The main idea is to have two databases/stores, one for the nodes, macros, workflows (left side) and one for the data resulting from running the node, workflow etc.

image
liamhuber commented 7 months ago

Follow up content from a conversation between me, @pmrv, and @JNmpi:

One can imagine two fundamentally different types of table: one for storing classes, i.e. templates to use in your new work, and one for storing instances, i.e. coupling a particular node with particular input (hash & storage location) to particular output (storage location). This is already reflected in the NodeDB and InstanceDB above.

For NodeDB we would imagine queries where a node class is given, and related node classes are returned -- e.g. alternate versions of the same node, or macros that contain that node somewhere in their graph.

For InstanceDB we the fundamental table might be something like "Node specifier" (package+class+version), "input hash", "storage location". So each time a node is run and saved, it registers these things with the DB. (One can imagine refactoring this database, like levels above that that is (package, class, subtable), (version, subsubtable), where subsubtable is the one holding node_specifier, input_hash, location or what have you). Some ideas of the sorts of queries we might make

The last case shows how such a DB could be used to accelerate workflows where the compute cost is high relative to the storage cost; conversely, if the compute cost is trivial we might never want to store it in the instance DB!

Note also that if the node output is stochastic, responsibility then falls to the node designer to make sure the random seed is also included in the IO. This wouldn't be perfect because I guess there is some hardware- and cosimic ray-dependence, but it would let us store multiple copies of output the same nominal input -- particularly important for, e.g., molecular dynamics.

These thoughts are neither complete nor, relative to the sketched figure above, particularly novel -- I just wanted to add some detail, as I understood it, following our very nice chat.

liamhuber commented 1 month ago

@JNmpi, @jan-janssen and I had a nice meeting on this topic this morning, and Joerg has content already pushed in the pyiron_nodes repo.

Here are my takeaways for the target spec:

@jan-janssen, @JNmpi did I miss any key points?

jan-janssen commented 1 month ago

When I read through the posts in this issue I wonder: Are they connected? compatible? In principle the idea of a class database and an instance database seems to be missing in the recent summary. Maybe a concrete example can help. What is the minimal database implementation which would already benefit a specific use case? The fitting of machine learning potentials with different hyper parameters like number of basis functions, different training sets and the validation with tools like calculating energy volume curves. So it's basically a three component workflow and by hashing the different components the connections get more clear.

JNmpi commented 1 month ago

Thanks @liamhuber for nicely summarizing our discussion. A simple approach to mark which input is regarded as human readable and can serve as an endpoint would be to mark (in addition to simple data types such as strings, int, etc.) certain dataclass definitions as compatible. An example would be a VASP incar datatype, whereas a dataclass storing atomistic structures or trajectories would be rejected. We thus should provide support for pyiron expanded dataclass objects that contain such internal (ontological) information and hints.

I also feel that in most practical use cases, the complexity of node usage decreases with increasing level. For example, a top-level macro node will only have input and output, no cyclic workflows, etc. So I am pretty confident that in most practical cases we will only need the hash-based node storage for "simple" workflows connecting high-level macros, which themselves can be highly complex. So we will not have to worry about cyclicity, multiple outputs going into a single input, etc. I agree with @jan-janssen that real-world examples are crucial, and this is what I intended with the examples in the Jupyter notebook. They helped me a lot to better understand what structures/concepts we need.