💡 IO and for-loop roadmap

liamhuber commented 7 months ago

As I work more with @JNmpi's #33 and the iter (and dataclass) functionality therein, I've been collecting some ideas about how to modify the way IO is defined on a node. I'll break these ideas into separate issues (or link existing issues), but I also wanted a sort of high-level summary to help clarify the way the ideas support each other. Concretely, I'd like to make the following changes:

[x] Make output_labels strictly a class option, and not allow changing it on instances (#265, #266, #269)
[x] Don't allow inputs_map or outputs_map on macros (#276)
- [x] ~(I feel less strongly about this) Don't allow inputs_map or outputs_map on workflows either, i.e. only allow access via the standard "scoped label" (f"{target_channel.owner.label}__{target_channel.label}")~ Present the IO maps in the pedagogical material, but as a sort of optional thing and then mostly avoid using them (#283)
[x] Don't auto-populate macro IO from unconnected child IO, i.e. force the use of the function-like signature definition of IO (#276)
[x] (Bonus) make macros more efficient by only introducing UI nodes where necessary (#277)
[x] Don't support self for function nodes (#279)
[x] Scrape return labels from the Macro.graph_creator too, but strip f"{first argument}." from them, e.g. return macro.n1, macro.n2 just gets output labels n1 and n2 (if the function is def MyMacro(macro, ...)
- Just raise an error if the scraped label still has a "." in it -- then you need to provide an output label
[x] Extract common "static IO" behaviour shared between AbstractFunction and AbstractMacro to a common parent class (#282)
[x] Make wf the canonical first-argument for macro graph_creator functions in the notebooks, and make it self in the tests -- the former lets us tell a story, the latter makes sense because that's what it is! (#283)

-- BREAK (merge above, then proceed below, maybe after a pause for efficiency/scaling work) --

[x] Introduce a new public-facing node class(es) e.g. DataNode (per @JNmpi's suggestion) for dataframes and dataclasses (#306, #308)
[x] Make a new "for" loop interface that creates a for-loop class with IO channels (including which are scattered on and which get broadcast) defined at the class level, but which dynamically (re)creates children on each run call so instances can freely adapt to input of different lengths (partial progress in #276, #309)

`output_labels` modification exclusively on classes

This is fairly simple. Right now, you can modify the output labels on each instance, e.g.

from pyiron_workflow import Workflow

renamed = Workflow.create.standard.UserInput(output_labels="my_label")
default = Workflow.create.standard.UserInput()
print(renamed.outputs.labels, default.outputs.labels)
>>> ['my_label'] ['user_input']

This has only changed the name of the output label for the instance renamed and hasn't changed the expected class IO signature for UserInput at all.

As of #266, for function nodes (children of AbstractFunction), it's no longer possible to do this -- output_labels is strictly available when defining the class, then this interface naming scheme is static for all instances of that class.

That means you can freely set them when using the Workflow.wrap_as.function_node(*output_labels) decorator or Workflow.create.Function(..., output_labels=None) class-creation interfaces, but then they're fixed.

The advantage to this is that we can already peek at the IO at the class level:

from pyiron_workflow import Workflow

@Workflow.wrap_as.function_node("xplus1", "xminus1")
def PlusMinusBound0(x: int) -> tuple[int, int | None]:
    return x + 1, None if x - 1 < 0 else x - 1

print(PlusMinusBound0.preview_output_channels())
>>> {'xplus1': int, 'xminus1': int | None}

print(PlusMinusBound0.preview_input_channels())
>>> {'x': (<class 'int'>, NOT_DATA)}

This is critical for guided workflow design (ala ironflow), and also helped to simplify some code under the hood.

I would like to make a similar change to AbstractMacro

No more maps

@samwaseda, when we talked after the pyiron meeting this week, I expressed my sadness at the unavoidability of the inputs_map and outputs_map for allowing power-users to modify existing macros. After giving it more thought, I'm pretty sure that we can get rid of them after all!

Since #265, AbstractMacro.graph_creator is a @classmethod (as is AbstractFunction.node_function). When combined with the idea above to guarantee that output_labels are strictly class and not instance features, that means that a power-user can modify an existing macro by defining a new macro class leveraging the base class's .graph_creator. Concretely, on #265 I can now do this:

from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node("original")
def MyWorkflow(macro, x, y):
    macro.n1 = x + y
    macro.n2 = macro.n1 ** 2
    return macro.n2

@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
     # First, create the graph you already like
    MyWorkflow.graph_creator(macro, x, y)

    # Then modify it how you want
    macro.n1.disconnect_all()
    macro.remove_child(macro.n1)
    macro.n1 = x - y
    macro.n2.inputs.obj = macro.n1

    macro.n3 = macro.n2 * z

    return macro.n2, macro.n3

m = ModifiedWorkflow(x=1, y=2, z=3)
m()
>>> {'renamed': 1, 'new': 3}

This isn't quite ideal yet, but with a few more changes I am confident I can get it down to

@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
    MyWorkflow.graph_creator(macro, x, y)

    macro.replace_child(macro.n1, x - y)
    macro.n3 = macro.n2 * z

    return macro.n2, macro.n3

This doesn't offer identical functionality to being able to set inputs_map and outputs_map, but IMO it offers equivalent functionality in a more transparent and robust way.

Get rid of the maps entirely

At the same time, I'd like to get rid of the maps completely by removing them from Workflow too! This just means that you can't define shortcuts to IO at the workflow level and always need to use the fully-scoped name, like wf(some_child__some_channel=42) instead of adding a map wf.inputs_map = {"some_child__some_channel": "kurz"}; wf(kurz=42). This is a price I'm willing to pay to remove the complexity from both the code and the user's head, but I'm not married to this part of the idea.

Don't auto-populate macro IO

Finally, the default right now is that if you don't use the function-like definition or output_labels for your macro, you get IO based on the unconnected children, i.e.

from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node()
def AutomaticMacro(macro):
    macro.n1 = Workflow.create.standard.UserInput(user_input=0)
    macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)

auto = AutomaticMacro()
print(auto.inputs.labels, auto.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']

Is equivalent to

from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node("n2__user_input")
def ExplicitMacro(macro, n1__user_input=0):
    macro.n1 = Workflow.create.standard.UserInput(user_input=n1__user_input)
    macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)
    return macro.n2

explicit = ExplicitMacro()
print(explicit.inputs.labels, explicit.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']

I'd like to stop auto-populating things and force the macro definition to be explicit.

Cons:

Might be inconvenient sometimes

Pros:

Zen of python "explicit is better than implicit"
Reduces mental load by making macro definitions act more like function definitions
Shrinks the codebase a little bit

An aside on efficiency

Right now, when a macro has input arguments in its signature beyond the first macro: AbstractMacro item, when we build the graph we prepopulate it with UserInput nodes for each signature item. This works fine, and is necessary when that input is getting bifurcated to be used in multiple child nodes -- but if we require the function signature approach to graph definition, there will be times when the input is being used in only one place and it's downright inefficient to stick an intermediate UserInput node in the way! The macro-level input can simply "value link" to the child node's input directly.

I already made a branch yesterday that takes care of this and purges such useless nodes at the end of the graph creation, so there's no big concern about efficiency. Unfortunately, while it went 99% smoothly, this feature interacts poorly with the combination of input maps and storage, so just a couple of tests fail where a workflow owns and reloads a macro. I am confident that adding this efficiency change back in will be possible after output_labels are class properties and inputs_map is gone.

Stop supporting `self` for function node

@JNmpi, when we had to stop ourselves from hijacking the pyiron meeting on Monday to talk about pyiron_workflow, you seemed to me to be expressing the idea that function nodes should be really stateless, and if someone wants state they should just write a function node to handle it and put the whole thing in a macro. I am 100% on board with this perspective -- let's really encourage function nodes to be functional!

To do this, I'd like to just get rid of support for self showing up in AbstractFunction.node_function functions entirely. It already breaks in some places that we need to work around, so it will feel good to remove it.

From an ideological and UX perspective, I really like this, because now at this point in the todo list function nodes are stateless and always wrap a function like def foo(x, y, z), and macro nodes are stateful and always wrap a function that basically has self in it like def bar(macro, a, b, c).

Data nodes

IMO, the one real downside to forcing users to explicitly define their node IO as part of the function signature/output labels is that it might get a bit verbose for nodes with lots of input -- this is especially true for macros.

@JNmpi in #33 has already been working with dataclasses to package together sets of related input. This allows sensible defaults to be provided, and lets developers build up input/output by composition using multiple inheritance. All great stuff. In the context of nodes, I see this making it more succinct to write the IO like this:

@Workflow.wrap_as.macro_node("result")
def MyNewMacro(macro, input_collection: Workflow.create.some_package.SomeDataNode.dataclass):
    macro.n1 = Workflow.create.some_package.SomeNode(input=input_collection)
    macro.n2 = Workflow.create.some_package.SomeOtherNode(
        macro.n1.output_collection.foo  # Leveraging node injection to grab a property off the output class
    )
    macro.n3 = input_collection.bar - macro.n2 
    # Again, `` is actually the input node `input_collection` and then we use node-injection to grab its `bar` attribute
    return macro.n3

Then even if the dataclass has lots of fields, we don't need to write them all out all the time.

This idea is already discussed on #208.

For loops

Ok, so with all this in place we can get to the actual meat of the project which is facilitating clean, powerful, and robust for-loops. @JNmpi, you mentioned on Monday wanting to be able to pass nodes (or at least node classes) to other nodes, and I think it's the class-level-availability of IO signatures that is critical for this. Once we can do this and have SomeNodeClass.preview_input/output_channels() available for macro nodes like they already are for function nodes, we'll be able to pass classes to a constructor that dynamically defines a new class with corresponding IO!

The spec for a for-loop is then to have a creator like Function (NOT AbstractFunction) that creates a new child of AbstractMacro that...

Takes a node class (or node instance) as input
- Passing an instance is just a convenient polymorphism; we would leverage the instances .__class__ attribute to create the new for-macro class, and then use the instance's specific IO values and connections to update the IO of the new for-macro instance
The for-macro class has the same IO signature as the input class(/instance)
The creator requires specifying which of these inputs are going to be scattered to child nodes, and which are going to be broadcast to child nodes
- Concretely, I imagine using the ironflow syntax of capitalization to combine this specification with values like ForLoop(Workflow.create.atomistics.BulkStructure, LATTICE=[3.9, 4, 4.1], species="Al") or simply specifying which ones are going to get scattered and delay actual values to later like ForLoop(Workflow.create.atomistics.BulkStructure, scatter_on=("lattice",)
At this point, just as Function:AbstractFunction, we have a ForLoop that is dynamically creating a new child class of some AbstractForLoop where the body node class and IO (including broadcast vs scattered) are all defined at the class level, and then we immediately instantiate it and are free to populate this IO (possibly using some body class instance's values, if an instance was passed instead of a class)
The form of the IO is fixed, but we want to be free to vary the length of the scattered input! This can be accomplished by modifying the run call so that at each call all the for-macro's children get removed (if there are any), and $N$ of them get re-instantiated and reconnected(re-linked) with the macro's input (whether it's being broadcast or scattered), and run again
- This touches on a-priori and post-facto graph provenance! The a-priori is now that for loops have a particular structure, and after running there is exactly one child node for each item in the for loop! I hope something similar can be worked out to give while-loops the same level of post-facto provenance, but I haven't figured it out yet.
- Remember that since macros are "walled gardens", no one outside the macro actually makes IO connections to the children's IO channels -- we can safely delete and re-instantiate them internally while the macro as a whole maintains all its data connections with its siblings
The for-macro should have two outputs, scattered_dataframe: pandas.DataFrame ala @JNmpi's iter method that links the scattered input items to child node output, and broadcast_input: DotDict that gives easy access to all the input that is identical across all rows of the dataframe. (In principle we could return just the dataframe, but duplicating the non-scattered input like that seems needlessly memory-inefficient...)
- Returning a dataframe differs from the existing For meta-node that has list-like channels for each of the child node's input. That's because my For meta-node handles only a single for loop, while @JNmpi's iter method handles looping on multiple values in nested for-loops. Hist way is better
As syntactic sugar, I'd like to provide both ForZipped and ForNested interfaces, where the nested version is like the current iter on #33 and Zipped zips instead of nesting the scattered input.
Finally, expose a shortcut to creating such a node on the Node.iter method (well, nodes and iter_zipped and iter_nested, probably)
These iter methods will need to work a little differently on Workflow, which is a parent-most object and not able to pass itself in as the relevant class for the ForLoop creator class, but this is an edge-case that can be defined in the Workflow class itself by overriding the methods.
At the instance-level, we will want some convenience methods/features for distributing the same executor to all the children at once (again, similar to #33 and its max_workers)

That's a monstrous wall of text, so let me see if I can end with a for loop syntax example

from concurrent.futures import ThreadPoolExecutor
import numpy as np

from pyiron_workflow import Workflow
Workflow.register("some.atomistics.module", "atom")

@Workflow.wrap_as.macro("energy")
def BulkLammps(macro, species, lattice, cubic):
    macro.bulk = Workflow.create.atoms.Bulk(species=species, cubic=cubic)
    macro.engine = Workflow.create.atoms.Lammps()
    macro.static = Workflow.create.atoms.Static(
        structure=macro.bulk, 
        engine=macro.engine
    )
    energy = macro.static.energy_pot  
    # Here I imagine that `Static` is returning an instance of some `StaticAtomisticsOutput`
    # and that it's a single value node, so above we're actually leveraging node injection 
    # to say `energy = Workflow.create.standard.GetAttr(macro.static.outputs.output, "energy_pot")
    return energy

wf = Workflow("my_for_loop_example")
wf.calculation = Workflow.create.standard.ForNested(
    BulkLammps,
    SPECIES=["Al", "Cu"],  # Scattered
    LATTICE=np.arange(2, 6, 100),  # Scattered
    cubic=True  # Broadcast,
    child_executor=ThreadPoolExecutor(max_workers=2),
)
# Then expoit node injection to operate on the for loop's dataframe
wf.plot_Al = wf.create.plotting.Plot(
    x=wf.calculation[wf.energies["species"] == "Al"]["lattice"].values,
    y=wf.calculation[wf.energies["species"] == "Al"]["energy"].values,
)
wf.plot_Al = wf.create.plotting.Plot(
    x=wf.calculation[wf.energies["species"] == "Cu"]["lattice"].values,
    x=wf.calculation[wf.energies["species"] == "Cu"]["energy"].values,
)

wf()  # Run it

# Or again with more data and more power
wf.calculation.set_child_executor(ThreadPoolExecutor(max_workers=20))
wf(calculation__lattice=np.arange(2, 6, 10000))

Or if we don't care about the workflow but just want to get a dataframe quickly, we could use the shortcut to say something like:

m = BulkLammps()
df = m.iter_nested(
    SPECIES=["Al", "Cu"],  # Scattered
    LATTICE=np.arange(2, 6, 100),  # Scattered
    cubic=True  # Broadcast,
    child_executor=ThreadPoolExecutor(max_workers=2),
)  # Makes the new for node, runs it, and returns the dataframe

Linked issue #72

Conclusion

I've already started down this path with #265, #266, and the un-pushed work pruning off the unused macro IO nodes. I like this direction and will just keep hacking away at it until main and #33 are totally compliant with each other. I am very happy for any feedback on these topics!

liamhuber commented 7 months ago

The current (#276) for- and while- loop implementation relies on dynamically creating a new macro class in string-form and then executing it. This works pretty well, but is absolutely wrecking me when it comes to type hints and defaults. These are readily available on the body classes (e.g. loop_body_class.preview_input_channels()), the problem is that getting the code objects there reliably converted into a string that will then execute back into those objects is just beastly. I'd like to move away from this execute-a-string paradigm for the loops, but am not at that stage yet.

liamhuber commented 7 months ago

@JNmpi and I chatted today. We came to a few simple conclusions about IO UX:

Scrape return labels from graph_creator like you do with Function.node_function, but strip off self. to form the labels (or macro. or wf., or whatever the first argument is called), as this handles most cases.
Make self the canonical first-argument for graph_creator in the tests, but leave wf around in the notebooks sometimes, as it helps tell the narrative of converting a workflow into a macro
Just leave the Workflow IO maps -- we can get rid of them later if it turns out people don't need them

These are updated in the task list above.

We also discussed loop syntax and found some pseudocode that we both like where we create (and sometimes instantiate) new iterable classes, but also avoid the ALL CAPS non-pythonic formalism of ironflow:

foo.iter(a=bar.zip(), b=[4,5,6])

# Often used by users:
foo.Zip(
    a=[1,2,3], 
    v=[4,5,6]
).Iter(
    c=[1,2]
).run(d="broadcast")

# Under the hood:
MyFooZipIter = Foo.Zip(
    a=[1,2,3], 
    v=[4,5,6]
).Iter(
    c=[1,2]
) 
MyFooZipIter(d="broadcast", a=[1,2,3], v=[3,4,5]).run(d="broadcast")

looped_foo(d="broadcast")

Thinking about the return values, we concluded that it's optimal to still only return the dataframe alone and trust users to manage their own upstream data sources rather than returning broadcast input and forcing them to always reference specific channels. This is because 99% of the time they're going to just be doing something simple and we should make that easy:

# TWO RETURNS
wf.some_macro = SomeIter(some_kwarg=4, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(
    x=wf.some_macro.outputs.broadcast_input["some_kwarg"], 
    y=wf.some_macro.outputs.df["some_col"]
)

# ONE RETURN
wf.my_kwarg = Workflwo.crreate.standard.UserInput(4)
wf.some_macro = SomeIter(some_kwarg=wf.my_kwarg, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(
    x=wf.my_kwarg,
    y=wf.some_macro["some_col"]
)

# The usual case where it doesn't matter
wf.some_macro = SomeIter(some_kwarg=4, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(y=wf.some_macro["some_col"])

pyiron / pyiron_workflow