pyiron / pyiron_contrib

User developments to extend and modify core pyiron functionality
https://pyiron.org
BSD 3-Clause "New" or "Revised" License
8 stars 10 forks source link

Node granularity - When do we consider a node to be macro? #764

Closed mbruns91 closed 1 year ago

mbruns91 commented 1 year ago

After our discussion on Monday, I was thinking more about the concept of node granularity and I actually couldn't come up with a 'natural' way to define something I'd call an 'atomic node'. An atomic node is for me a node which cannot be decomposed any further into sub-nodes. Taking the example from @JNmpi of a numpy linspace node (as far as I recall correctly),

@np_linspace
def linspace() -> np.array[int]:
    return np.linspace()

I'd suppose that's as atomic as we should reasonably go. One could, however, decompose this further by looking into numpy's internals, see how linspace is exactly executed, create a node for each of these steps and voilà, the numpy linspace node could also be constructed as a composite node (which is imho a better term than "macro node"). Thus we have to come up with definitions for the levels of node-granularity and in this context node types.
I'd start with the terms I already used:

Above example also illustrates that this will be important for node-developers (and maybe node-store maintainers?) as it can easily happen that nodes that were considered to be atomic during development turn out to be composite. Also, this can change over time, when e.g. more atomic nodes are developed, making older, more complex atomic nodes actually composite or, in other terms, make it possible to express them as such. Perhaps we even need more than two node types or more types of macro nodes (of which composite_node would then be a subclass)? What are you guys thinking about this?

niklassiemer commented 1 year ago

I absolutely agree!

I think we will end up with multiple levels, e.g. a Lammps node should probably feel for nearly everyone as if it is atomic, but internally it will have a lammps 'input writer', a 'call executable Lammps', and an 'output parser' at the least.

liamhuber commented 1 year ago

This is a good question! Right now we err on the side of extreme pragmatism: a node is a macro (i.e. Composite) iff it has its own internal graph structure, i.e. it owns a collections of other Node objects and its execution means executing these. Meanwhile an atomic, or Function(Node) is one that has no such internal structure, but executes arbitrary python code.

So we have no concept of an atomic_node that is "unable" to be further decomposed, nor is the composite_node restricted to only be "composed of atomic nodes". Rather, by fiat a Function node executes python code, and a Composite node executes other nodes. (This includes nesting Composite nodes inside other Composite nodes, which facilitates powerful abstraction.)

So the question of "when do we consider a node to be macro" becomes extremely simple and has a technical answer, and the responsibility "should I implement this as a functional node or a macro node?" gets devolved to individual node developers. This also allows, from a node user perspective, for a particular node to easily switch from functional to composite under the hood without changing its public interface. I'm certainly open to criticism of this paradigm or alternatives, but am so far very happy with it.

@niklassiemer's example is then spot-on: a first implementation of a "Lammps" node basically looks like a Function wrapper for a Lammps job. But we can exactly imagine decomposing this into a macro that treats parsing and execution separately. That might be useful if you want to, e.g., extend the parser to treat an extra fix, so that you only need to interact with a minimal component of the overall lammps flow.

If you look at pyiron_contrib.workflow.node_library.atomistics you can already see a first crack at this, where we've decomposed the Lammps engine (a structure and potential) from execution (either MD or static) -- this contrasts with the existing pyiron implementation where we call job.calc_static() or similar right on the engine object. This is an imperfect split, as it's a bit awkward to have the structure already belonging to the engine (where it is used to restrict and facilitate a choice of potential), when the calculation (static or MD) can use any potential-species-compliant structure.

How to decompose an idea into the most useful Function components so that they give us the most power and clarity for recombination into Composite tools will be an art that takes practice, just like usual architecture choices.

JNmpi commented 1 year ago

Thanks for this interesting discussion. I fully agree that this is a fundamental issue for our new node-based workflow architecture. While the internal structure of a atomic and composite node will be different their user behavior should be identical. Thus, in a very pragmatic way in most cases the user should see absolutely no difference whether using an atomic or a composite node. The main difference is that from a composite node a workflow can be obtained i.e., via wf = my_composite_node.get_workflow(), whereas an atomic node would return None. Since the node behavior is identical, upgrading an atomic node to a composite one would have no impact on existing workflows. Thus, the main (only) impact of this definition will be on the construction side. There, again, we can be super pragmatic. A composite node would be defined directly via a workflow, i.e., define a workflow with nodes, connections, default parameters etc. and convert it into a composite node: my_composite_node = my_workflow.to_node(path='my_node_library').

mbruns91 commented 1 year ago

Thanks for your detailed replies to my reasoning. I already anticipated that answering the opening question ("When do we consider a node to be macro?") boils down to a simple technical statement ("Any node that is not atomic" or a Function(node)). I believe this to be a strength, as it avoids that we have to come up with definitions of such concepts. This would just end in more or less philosophical discussions on some higher-level semantics.

This also allows, from a node user perspective, for a particular node to easily switch from functional to composite under the hood without changing its public interface. I'm certainly open to criticism of this paradigm or alternatives, but am so far very happy with it.

Also this is a strength in my opinion. @JNmpi has detailed this in his reply and @niklassiemer @liamhuber already hinted to that: In the end, we can represent whole, complex workflows as single nodes and for the user it will be as simple as using any Function(node), only it may need (maybe a lot) more input than typical atomic nodes.
There is just one issue I see coming: when we actually have composite nodes created from composite nodes, which again are created from composite nodes and so on this can easily become super intransparent and might create complicated cross dependencies. When an important atomic_node is changed, this can trigger an avalanche of failing composite_nodes.

Of course we could say that it is in the responsibility of workflow and node developers to avoid such things. But the concept of coarse-graining nodes is super attractive because it (as @liamhuber wrote)

facilitates powerful abstraction

and it might be tempting to abstract all ugly code off into nirvana.

So my follow-up question would be: How can we ensure that highly abstract composite_nodes are not becoming inscrutable blackboxes? I guess an easy way would be to require detailed docstrings (or something like that). An even better solution would be to provide a composite_node.decompose(file='decomposed_node.py', **kwargs) method, which outputs the very same node (but decomposed!) to a python file, which could then be loaded and used just like the composite_node itself or is viewed to better understand what's going on under the hood. One could even come up with a keyword argument granularity_level (or however you want to call it) which can be used to control the level of decomposition (e.g. granularity_level='next' just decomposes down one level while granularity_level='atomic' decomposes everything down to the scale of atomic_nodes). This way, we could stick to the possibility of arbitrary abstraction while still ensuring transparency.

And even more: when a decomposed representation is exported to some .py file, a user can super easily try to customize complex composite_nodes, which could come in handy.

liamhuber commented 1 year ago

There is just one issue I see coming: when we actually have composite nodes created from composite nodes, which again are created from composite nodes and so on this can easily become super intransparent and might create complicated cross dependencies. When an important atomic_node is changed, this can trigger an avalanche of failing composite_nodes.

This is absolutely a concern. For the "wild west" of node packages, I think there's really not much we can do and it's simply up to individuals.

For any sort of official "node store", Joerg and I (and others?) have talked about having per-node version control. Inside the scope of a "store" that we manage, strict versioning could be a requirement. In this way, downstream workflows can specify the version of upstream nodes that they use to prevent upstream updates from irrevocably or accidentally breaking things. Of course, staying up-to-date with the latest upstream nodes still becomes the node developers' responsibility, but at least it would provide an environment where stable thing stay stable and updates are methodical.

We don't have any particular technical implementation in mind for this version control yet. One nice feature would be if node developers could even specify dependencies on multiple versions of the same package. I.e. I could have in my macro both the Foo node from foobar v0.1.3 and the Foo node from foobar v1.2.0! This will probably not be a trivial or usual thing, since its possible that these foobar packages have incommensurate conda requirements. However, tools like Snakemake allow different envs for different nodes, so it's not impossible.

So my follow-up question would be: How can we ensure that highly abstract composite_nodes are not becoming inscrutable blackboxes?

To some degree you're right, the more deeply we nest workflows and abstract things, the more black-box the workflows will become. As we move to support more experimentalists and even industry, I suspect we'll get fewer and fewer complaints about that... For instance, look at the prevailing (though not universal) acceptance of deep neural networks as wonderful black-box solutions. (Not everyone feels that this need be the case, so I will take this opportunity to plug Christopher Olah's excellent blog, e.g. this article)

Of course for those who do care to understand the innards, we need to offer useful tools.

I guess an easy way would be to require detailed docstrings (or something like that). An even better solution would be to provide a composite_node.decompose(file='decomposed_node.py', **kwargs) method, which outputs the very same node (but decomposed!) to a python file, which could then be loaded and used just like the composite_node itself or is viewed to better understand what's going on under the hood.

Absolutely. With an actual Workflow instance, the person who instantiated it and added nodes to it should already know what's going on, but they have the wf.nodes list to look at and we'll give wf.visualize soon.

Joerg mentioned above Workflow.to_node -- it's very much my vision that such a method produces human readable python code, namely a sub-class of Macro(Composite) that defines what child nodes should be instantiated, their connections, and initial data values. I'm not sure this is exactly what you have in mind, but it sounds similar.

In this way, whenever someone is working with a Composite(Node) it is...

One could even come up with a keyword argument granularity_level (or however you want to call it) which can be used to control the level of decomposition (e.g. granularity_level='next' just decomposes down one level while granularity_level='atomic' decomposes everything down to the scale of atomic_nodes). This way, we could stick to the possibility of arbitrary abstraction while still ensuring transparency.

I really like this idea as a kwarg for the visualize method -- you could keep (potentially nested) macros condensed, or expand them out -- maybe encapsulated by a box or something. Of course one could always look at wf.nodes.my_macro_node.node.sub_macro_node.visualize() to peek at the innards of a macro two levels down, but maybe it's easier to just write wf.visualize(granularity=2)!

And even more: when a decomposed representation is exported to some .py file, a user can super easily try to customize complex composite_nodes, which could come in handy.

Absolutely. I can imagine a couple routes here. One could do it live in the jupyter notebook by converting a macro to a workflow and modifying it:

wf = SomeMacroClass().as_workflow()
wf.replacement_node = MyReplacementNode()
# Then make connections -- 
# if my replacement has the same IO labels I might just write
for label, old_input in wf.node_i_dont_like.inputs.items():
    for con in old_input.connections:
        wf.replacement.inputs[label].connect(con)

for label, old_output in wf.node_i_dont_like.outputs.items():
    for con in old_output.connections:
        wf.replacement.outputs[label].connect(con)

# Then remove the node I'm replacing
wf.node_i_dont_like.remove()  # Disconnects and deletes

# Finally, save the updated guy as a new macro
SomeBetterMacroClass = wf.to_node("SomeBetterMacroClass")

Alternatively, I could look at the source code for the macro class, copy it to a .py file (e.g. with an export method), and modify the nodes/connections/default data right there in my IDE, then import the new modified macro.

JNmpi commented 1 year ago

One more thought regarding (fully automated) testing if the new version of a node will break things. Having a node store and a database that contains not only the input and output of the workflows but also of the individual nodes it is composed of will allow us to extract input and output parameters of nodes and run (automated) tests to check that the new version does not break it. The idea is that if the new version does not break the input-output relation for existing applications it will work also for new applications. This approach can be also extended to run and test entire workflows (which in contrast to the Jupyter notebooks that in our present pyiron implementation contain the workflow are part of the database). The new node based structure offers really nice opportunities to make our live also in terms of testing etc. easier.

srmnitc commented 1 year ago

The simplest solution for a node store could be to ship them as conda packages. Therefore we would have the version controlling, dependencies, and so on as we have for packages currently. We would need to use a custom channel, and then we could simply have the env file or install nodes like conda install -c pyiron-node-store custom-node. We would be reinventing a bit something like conda-forge channel..

mbruns91 commented 1 year ago

Decomposition of complex nodes

I really like this idea as a kwarg for the visualize method -- you could keep (potentially nested) macros condensed, or expand them out -- maybe encapsulated by a box or something. Of course one could always look at wf.nodes.my_macro_node.node.sub_macro_node.visualize() to peek at the innards of a macro two levels down, but maybe it's easier to just write wf.visualize(granularity=2)!

That would be super nice. Having a method to visually decompose complex nodes is user friendly and - sufficient experience assumed - makes it super easy to understand qualitatively what is happening with input. To translate the visual decomposition into human-readable python code would then be the icing of the cake.

Node deployment and version-control

I really like the idea to encapsulate nodes. After all the discussion here, conceptually it just makes sense to not differentiate between a node and a whole workflow, so both should be able to run individually (when provided "correct" input) in a given environment. Using conda and a dedicated channel would be one solution, I guess. Another one that comes to my mind is to containerize nodes. A workflow would then be a network of containers and the corresponding graph basically shows the container network topology. With container images we can ensure a high level of version control for the workflow developers. The nodes, a database storing inputs and outputs and any other "service" we want to run aside when a workflow is executed could then be set up within this container network. There is also a way to allow multi-host networking for a docker network. This way, a node which requires certain resources could run directly on a machine providing them (I'm thinking of hpc). So far I'm only closer familiar with docker. There are also solutions like sigularity, which doesn't require sudo privileges for a group associated with the container software. Maybe kubernetes clusters are also suitable for our purpose, but I need a closer look.

niklassiemer commented 1 year ago

For (larger) composite nodes, a container version could be reasonable :) for small or atomic nodes this sounds like overkill... (spawning a container for a linspace or a structure creation) As such I think we will probably need other means of version control for the nodes in general.