pyiron / pyiron_workflow

Graph-and-node based workflows
BSD 3-Clause "New" or "Revised" License
10 stars 1 forks source link

💡 IO and for-loop roadmap #268

Closed liamhuber closed 1 month ago

liamhuber commented 3 months ago

As I work more with @JNmpi's #33 and the iter (and dataclass) functionality therein, I've been collecting some ideas about how to modify the way IO is defined on a node. I'll break these ideas into separate issues (or link existing issues), but I also wanted a sort of high-level summary to help clarify the way the ideas support each other. Concretely, I'd like to make the following changes:

-- BREAK (merge above, then proceed below, maybe after a pause for efficiency/scaling work) --

output_labels modification exclusively on classes

This is fairly simple. Right now, you can modify the output labels on each instance, e.g.

from pyiron_workflow import Workflow

renamed = Workflow.create.standard.UserInput(output_labels="my_label")
default = Workflow.create.standard.UserInput()
print(renamed.outputs.labels, default.outputs.labels)
>>> ['my_label'] ['user_input']

This has only changed the name of the output label for the instance renamed and hasn't changed the expected class IO signature for UserInput at all.

As of #266, for function nodes (children of AbstractFunction), it's no longer possible to do this -- output_labels is strictly available when defining the class, then this interface naming scheme is static for all instances of that class.

That means you can freely set them when using the Workflow.wrap_as.function_node(*output_labels) decorator or Workflow.create.Function(..., output_labels=None) class-creation interfaces, but then they're fixed.

The advantage to this is that we can already peek at the IO at the class level:

from pyiron_workflow import Workflow

@Workflow.wrap_as.function_node("xplus1", "xminus1")
def PlusMinusBound0(x: int) -> tuple[int, int | None]:
    return x + 1, None if x - 1 < 0 else x - 1

print(PlusMinusBound0.preview_output_channels())
>>> {'xplus1': int, 'xminus1': int | None}

print(PlusMinusBound0.preview_input_channels())
>>> {'x': (<class 'int'>, NOT_DATA)}

This is critical for guided workflow design (ala ironflow), and also helped to simplify some code under the hood.

I would like to make a similar change to AbstractMacro

No more maps

@samwaseda, when we talked after the pyiron meeting this week, I expressed my sadness at the unavoidability of the inputs_map and outputs_map for allowing power-users to modify existing macros. After giving it more thought, I'm pretty sure that we can get rid of them after all!

Since #265, AbstractMacro.graph_creator is a @classmethod (as is AbstractFunction.node_function). When combined with the idea above to guarantee that output_labels are strictly class and not instance features, that means that a power-user can modify an existing macro by defining a new macro class leveraging the base class's .graph_creator. Concretely, on #265 I can now do this:

from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node("original")
def MyWorkflow(macro, x, y):
    macro.n1 = x + y
    macro.n2 = macro.n1 ** 2
    return macro.n2

@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
     # First, create the graph you already like
    MyWorkflow.graph_creator(macro, x, y)

    # Then modify it how you want
    macro.n1.disconnect_all()
    macro.remove_child(macro.n1)
    macro.n1 = x - y
    macro.n2.inputs.obj = macro.n1

    macro.n3 = macro.n2 * z

    return macro.n2, macro.n3

m = ModifiedWorkflow(x=1, y=2, z=3)
m()
>>> {'renamed': 1, 'new': 3}

This isn't quite ideal yet, but with a few more changes I am confident I can get it down to

@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
    MyWorkflow.graph_creator(macro, x, y)

    macro.replace_child(macro.n1, x - y)
    macro.n3 = macro.n2 * z

    return macro.n2, macro.n3

This doesn't offer identical functionality to being able to set inputs_map and outputs_map, but IMO it offers equivalent functionality in a more transparent and robust way.

Get rid of the maps entirely

At the same time, I'd like to get rid of the maps completely by removing them from Workflow too! This just means that you can't define shortcuts to IO at the workflow level and always need to use the fully-scoped name, like wf(some_child__some_channel=42) instead of adding a map wf.inputs_map = {"some_child__some_channel": "kurz"}; wf(kurz=42). This is a price I'm willing to pay to remove the complexity from both the code and the user's head, but I'm not married to this part of the idea.

Don't auto-populate macro IO

Finally, the default right now is that if you don't use the function-like definition or output_labels for your macro, you get IO based on the unconnected children, i.e.

from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node()
def AutomaticMacro(macro):
    macro.n1 = Workflow.create.standard.UserInput(user_input=0)
    macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)

auto = AutomaticMacro()
print(auto.inputs.labels, auto.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']

Is equivalent to

from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node("n2__user_input")
def ExplicitMacro(macro, n1__user_input=0):
    macro.n1 = Workflow.create.standard.UserInput(user_input=n1__user_input)
    macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)
    return macro.n2

explicit = ExplicitMacro()
print(explicit.inputs.labels, explicit.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']

I'd like to stop auto-populating things and force the macro definition to be explicit.

Cons:

Pros:

An aside on efficiency

Right now, when a macro has input arguments in its signature beyond the first macro: AbstractMacro item, when we build the graph we prepopulate it with UserInput nodes for each signature item. This works fine, and is necessary when that input is getting bifurcated to be used in multiple child nodes -- but if we require the function signature approach to graph definition, there will be times when the input is being used in only one place and it's downright inefficient to stick an intermediate UserInput node in the way! The macro-level input can simply "value link" to the child node's input directly.

I already made a branch yesterday that takes care of this and purges such useless nodes at the end of the graph creation, so there's no big concern about efficiency. Unfortunately, while it went 99% smoothly, this feature interacts poorly with the combination of input maps and storage, so just a couple of tests fail where a workflow owns and reloads a macro. I am confident that adding this efficiency change back in will be possible after output_labels are class properties and inputs_map is gone.

Stop supporting self for function node

@JNmpi, when we had to stop ourselves from hijacking the pyiron meeting on Monday to talk about pyiron_workflow, you seemed to me to be expressing the idea that function nodes should be really stateless, and if someone wants state they should just write a function node to handle it and put the whole thing in a macro. I am 100% on board with this perspective -- let's really encourage function nodes to be functional!

To do this, I'd like to just get rid of support for self showing up in AbstractFunction.node_function functions entirely. It already breaks in some places that we need to work around, so it will feel good to remove it.

From an ideological and UX perspective, I really like this, because now at this point in the todo list function nodes are stateless and always wrap a function like def foo(x, y, z), and macro nodes are stateful and always wrap a function that basically has self in it like def bar(macro, a, b, c).

Data nodes

IMO, the one real downside to forcing users to explicitly define their node IO as part of the function signature/output labels is that it might get a bit verbose for nodes with lots of input -- this is especially true for macros.

@JNmpi in #33 has already been working with dataclasses to package together sets of related input. This allows sensible defaults to be provided, and lets developers build up input/output by composition using multiple inheritance. All great stuff. In the context of nodes, I see this making it more succinct to write the IO like this:

@Workflow.wrap_as.macro_node("result")
def MyNewMacro(macro, input_collection: Workflow.create.some_package.SomeDataNode.dataclass):
    macro.n1 = Workflow.create.some_package.SomeNode(input=input_collection)
    macro.n2 = Workflow.create.some_package.SomeOtherNode(
        macro.n1.output_collection.foo  # Leveraging node injection to grab a property off the output class
    )
    macro.n3 = input_collection.bar - macro.n2 
    # Again, `` is actually the input node `input_collection` and then we use node-injection to grab its `bar` attribute
    return macro.n3

Then even if the dataclass has lots of fields, we don't need to write them all out all the time.

This idea is already discussed on #208.

For loops

Ok, so with all this in place we can get to the actual meat of the project which is facilitating clean, powerful, and robust for-loops. @JNmpi, you mentioned on Monday wanting to be able to pass nodes (or at least node classes) to other nodes, and I think it's the class-level-availability of IO signatures that is critical for this. Once we can do this and have SomeNodeClass.preview_input/output_channels() available for macro nodes like they already are for function nodes, we'll be able to pass classes to a constructor that dynamically defines a new class with corresponding IO!

The spec for a for-loop is then to have a creator like Function (NOT AbstractFunction) that creates a new child of AbstractMacro that...

That's a monstrous wall of text, so let me see if I can end with a for loop syntax example

from concurrent.futures import ThreadPoolExecutor
import numpy as np

from pyiron_workflow import Workflow
Workflow.register("some.atomistics.module", "atom")

@Workflow.wrap_as.macro("energy")
def BulkLammps(macro, species, lattice, cubic):
    macro.bulk = Workflow.create.atoms.Bulk(species=species, cubic=cubic)
    macro.engine = Workflow.create.atoms.Lammps()
    macro.static = Workflow.create.atoms.Static(
        structure=macro.bulk, 
        engine=macro.engine
    )
    energy = macro.static.energy_pot  
    # Here I imagine that `Static` is returning an instance of some `StaticAtomisticsOutput`
    # and that it's a single value node, so above we're actually leveraging node injection 
    # to say `energy = Workflow.create.standard.GetAttr(macro.static.outputs.output, "energy_pot")
    return energy

wf = Workflow("my_for_loop_example")
wf.calculation = Workflow.create.standard.ForNested(
    BulkLammps,
    SPECIES=["Al", "Cu"],  # Scattered
    LATTICE=np.arange(2, 6, 100),  # Scattered
    cubic=True  # Broadcast,
    child_executor=ThreadPoolExecutor(max_workers=2),
)
# Then expoit node injection to operate on the for loop's dataframe
wf.plot_Al = wf.create.plotting.Plot(
    x=wf.calculation[wf.energies["species"] == "Al"]["lattice"].values,
    y=wf.calculation[wf.energies["species"] == "Al"]["energy"].values,
)
wf.plot_Al = wf.create.plotting.Plot(
    x=wf.calculation[wf.energies["species"] == "Cu"]["lattice"].values,
    x=wf.calculation[wf.energies["species"] == "Cu"]["energy"].values,
)

wf()  # Run it

# Or again with more data and more power
wf.calculation.set_child_executor(ThreadPoolExecutor(max_workers=20))
wf(calculation__lattice=np.arange(2, 6, 10000)) 

Or if we don't care about the workflow but just want to get a dataframe quickly, we could use the shortcut to say something like:

m = BulkLammps()
df = m.iter_nested(
    SPECIES=["Al", "Cu"],  # Scattered
    LATTICE=np.arange(2, 6, 100),  # Scattered
    cubic=True  # Broadcast,
    child_executor=ThreadPoolExecutor(max_workers=2),
)  # Makes the new for node, runs it, and returns the dataframe

Linked issue #72

Conclusion

I've already started down this path with #265, #266, and the un-pushed work pruning off the unused macro IO nodes. I like this direction and will just keep hacking away at it until main and #33 are totally compliant with each other. I am very happy for any feedback on these topics!

liamhuber commented 2 months ago

The current (#276) for- and while- loop implementation relies on dynamically creating a new macro class in string-form and then executing it. This works pretty well, but is absolutely wrecking me when it comes to type hints and defaults. These are readily available on the body classes (e.g. loop_body_class.preview_input_channels()), the problem is that getting the code objects there reliably converted into a string that will then execute back into those objects is just beastly. I'd like to move away from this execute-a-string paradigm for the loops, but am not at that stage yet.

liamhuber commented 2 months ago

@JNmpi and I chatted today. We came to a few simple conclusions about IO UX:

These are updated in the task list above.

We also discussed loop syntax and found some pseudocode that we both like where we create (and sometimes instantiate) new iterable classes, but also avoid the ALL CAPS non-pythonic formalism of ironflow:

foo.iter(a=bar.zip(), b=[4,5,6])

# Often used by users:
foo.Zip(
    a=[1,2,3], 
    v=[4,5,6]
).Iter(
    c=[1,2]
).run(d="broadcast")

# Under the hood:
MyFooZipIter = Foo.Zip(
    a=[1,2,3], 
    v=[4,5,6]
).Iter(
    c=[1,2]
) 
MyFooZipIter(d="broadcast", a=[1,2,3], v=[3,4,5]).run(d="broadcast")

looped_foo(d="broadcast")

Thinking about the return values, we concluded that it's optimal to still only return the dataframe alone and trust users to manage their own upstream data sources rather than returning broadcast input and forcing them to always reference specific channels. This is because 99% of the time they're going to just be doing something simple and we should make that easy:

# TWO RETURNS
wf.some_macro = SomeIter(some_kwarg=4, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(
    x=wf.some_macro.outputs.broadcast_input["some_kwarg"], 
    y=wf.some_macro.outputs.df["some_col"]
)

# ONE RETURN
wf.my_kwarg = Workflwo.crreate.standard.UserInput(4)
wf.some_macro = SomeIter(some_kwarg=wf.my_kwarg, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(
    x=wf.my_kwarg,
    y=wf.some_macro["some_col"]
)

# The usual case where it doesn't matter
wf.some_macro = SomeIter(some_kwarg=4, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(y=wf.some_macro["some_col"])