Closed liamhuber closed 6 months ago
The current (#276) for- and while- loop implementation relies on dynamically creating a new macro class in string-form and then executing it. This works pretty well, but is absolutely wrecking me when it comes to type hints and defaults. These are readily available on the body classes (e.g. loop_body_class.preview_input_channels()
), the problem is that getting the code objects there reliably converted into a string that will then execute back into those objects is just beastly. I'd like to move away from this execute-a-string paradigm for the loops, but am not at that stage yet.
@JNmpi and I chatted today. We came to a few simple conclusions about IO UX:
graph_creator
like you do with Function.node_function
, but strip off self.
to form the labels (or macro.
or wf.
, or whatever the first argument is called), as this handles most cases.self
the canonical first-argument for graph_creator
in the tests, but leave wf
around in the notebooks sometimes, as it helps tell the narrative of converting a workflow into a macroWorkflow
IO maps -- we can get rid of them later if it turns out people don't need themThese are updated in the task list above.
We also discussed loop syntax and found some pseudocode that we both like where we create (and sometimes instantiate) new iterable classes, but also avoid the ALL CAPS non-pythonic formalism of ironflow:
foo.iter(a=bar.zip(), b=[4,5,6])
# Often used by users:
foo.Zip(
a=[1,2,3],
v=[4,5,6]
).Iter(
c=[1,2]
).run(d="broadcast")
# Under the hood:
MyFooZipIter = Foo.Zip(
a=[1,2,3],
v=[4,5,6]
).Iter(
c=[1,2]
)
MyFooZipIter(d="broadcast", a=[1,2,3], v=[3,4,5]).run(d="broadcast")
looped_foo(d="broadcast")
Thinking about the return values, we concluded that it's optimal to still only return the dataframe alone and trust users to manage their own upstream data sources rather than returning broadcast input and forcing them to always reference specific channels. This is because 99% of the time they're going to just be doing something simple and we should make that easy:
# TWO RETURNS
wf.some_macro = SomeIter(some_kwarg=4, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(
x=wf.some_macro.outputs.broadcast_input["some_kwarg"],
y=wf.some_macro.outputs.df["some_col"]
)
# ONE RETURN
wf.my_kwarg = Workflwo.crreate.standard.UserInput(4)
wf.some_macro = SomeIter(some_kwarg=wf.my_kwarg, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(
x=wf.my_kwarg,
y=wf.some_macro["some_col"]
)
# The usual case where it doesn't matter
wf.some_macro = SomeIter(some_kwarg=4, scattered_inp=[1,2,3,4])
wf.some_node = SomeNode(y=wf.some_macro["some_col"])
As I work more with @JNmpi's #33 and the
iter
(and dataclass) functionality therein, I've been collecting some ideas about how to modify the way IO is defined on a node. I'll break these ideas into separate issues (or link existing issues), but I also wanted a sort of high-level summary to help clarify the way the ideas support each other. Concretely, I'd like to make the following changes:output_labels
strictly a class option, and not allow changing it on instances (#265, #266, #269)inputs_map
oroutputs_map
on macros (#276)inputs_map
oroutputs_map
on workflows either, i.e. only allow access via the standard "scoped label" (f"{target_channel.owner.label}__{target_channel.label}"
)~ Present the IO maps in the pedagogical material, but as a sort of optional thing and then mostly avoid using them (#283)self
for function nodes (#279)Macro.graph_creator
too, but stripf"{first argument}."
from them, e.g.return macro.n1, macro.n2
just gets output labelsn1
andn2
(if the function isdef MyMacro(macro, ...)
AbstractFunction
andAbstractMacro
to a common parent class (#282)wf
the canonical first-argument for macrograph_creator
functions in the notebooks, and make itself
in the tests -- the former lets us tell a story, the latter makes sense because that's what it is! (#283)-- BREAK (merge above, then proceed below, maybe after a pause for efficiency/scaling work) --
DataNode
(per @JNmpi's suggestion) for dataframes and dataclasses (#306, #308)run
call so instances can freely adapt to input of different lengths (partial progress in #276, #309)output_labels
modification exclusively on classesThis is fairly simple. Right now, you can modify the output labels on each instance, e.g.
This has only changed the name of the output label for the instance
renamed
and hasn't changed the expected class IO signature forUserInput
at all.As of #266, for function nodes (children of
AbstractFunction
), it's no longer possible to do this --output_labels
is strictly available when defining the class, then this interface naming scheme is static for all instances of that class.That means you can freely set them when using the
Workflow.wrap_as.function_node(*output_labels)
decorator orWorkflow.create.Function(..., output_labels=None)
class-creation interfaces, but then they're fixed.The advantage to this is that we can already peek at the IO at the class level:
This is critical for guided workflow design (ala
ironflow
), and also helped to simplify some code under the hood.I would like to make a similar change to
AbstractMacro
No more maps
@samwaseda, when we talked after the pyiron meeting this week, I expressed my sadness at the unavoidability of the
inputs_map
andoutputs_map
for allowing power-users to modify existing macros. After giving it more thought, I'm pretty sure that we can get rid of them after all!Since #265,
AbstractMacro.graph_creator
is a@classmethod
(as isAbstractFunction.node_function
). When combined with the idea above to guarantee thatoutput_labels
are strictly class and not instance features, that means that a power-user can modify an existing macro by defining a new macro class leveraging the base class's.graph_creator
. Concretely, on #265 I can now do this:This isn't quite ideal yet, but with a few more changes I am confident I can get it down to
This doesn't offer identical functionality to being able to set
inputs_map
andoutputs_map
, but IMO it offers equivalent functionality in a more transparent and robust way.Get rid of the maps entirely
At the same time, I'd like to get rid of the maps completely by removing them from
Workflow
too! This just means that you can't define shortcuts to IO at the workflow level and always need to use the fully-scoped name, likewf(some_child__some_channel=42)
instead of adding a mapwf.inputs_map = {"some_child__some_channel": "kurz"}; wf(kurz=42)
. This is a price I'm willing to pay to remove the complexity from both the code and the user's head, but I'm not married to this part of the idea.Don't auto-populate macro IO
Finally, the default right now is that if you don't use the function-like definition or
output_labels
for your macro, you get IO based on the unconnected children, i.e.Is equivalent to
I'd like to stop auto-populating things and force the macro definition to be explicit.
Cons:
Pros:
An aside on efficiency
Right now, when a macro has input arguments in its signature beyond the first
macro: AbstractMacro
item, when we build the graph we prepopulate it withUserInput
nodes for each signature item. This works fine, and is necessary when that input is getting bifurcated to be used in multiple child nodes -- but if we require the function signature approach to graph definition, there will be times when the input is being used in only one place and it's downright inefficient to stick an intermediateUserInput
node in the way! The macro-level input can simply "value link" to the child node's input directly.I already made a branch yesterday that takes care of this and purges such useless nodes at the end of the graph creation, so there's no big concern about efficiency. Unfortunately, while it went 99% smoothly, this feature interacts poorly with the combination of input maps and storage, so just a couple of tests fail where a workflow owns and reloads a macro. I am confident that adding this efficiency change back in will be possible after
output_labels
are class properties andinputs_map
is gone.Stop supporting
self
for function node@JNmpi, when we had to stop ourselves from hijacking the pyiron meeting on Monday to talk about
pyiron_workflow
, you seemed to me to be expressing the idea that function nodes should be really stateless, and if someone wants state they should just write a function node to handle it and put the whole thing in a macro. I am 100% on board with this perspective -- let's really encourage function nodes to be functional!To do this, I'd like to just get rid of support for
self
showing up inAbstractFunction.node_function
functions entirely. It already breaks in some places that we need to work around, so it will feel good to remove it.From an ideological and UX perspective, I really like this, because now at this point in the todo list function nodes are stateless and always wrap a function like
def foo(x, y, z)
, and macro nodes are stateful and always wrap a function that basically hasself
in it likedef bar(macro, a, b, c)
.Data nodes
IMO, the one real downside to forcing users to explicitly define their node IO as part of the function signature/output labels is that it might get a bit verbose for nodes with lots of input -- this is especially true for macros.
@JNmpi in #33 has already been working with dataclasses to package together sets of related input. This allows sensible defaults to be provided, and lets developers build up input/output by composition using multiple inheritance. All great stuff. In the context of nodes, I see this making it more succinct to write the IO like this:
Then even if the dataclass has lots of fields, we don't need to write them all out all the time.
This idea is already discussed on #208.
For loops
Ok, so with all this in place we can get to the actual meat of the project which is facilitating clean, powerful, and robust for-loops. @JNmpi, you mentioned on Monday wanting to be able to pass nodes (or at least node classes) to other nodes, and I think it's the class-level-availability of IO signatures that is critical for this. Once we can do this and have
SomeNodeClass.preview_input/output_channels()
available for macro nodes like they already are for function nodes, we'll be able to pass classes to a constructor that dynamically defines a new class with corresponding IO!The spec for a for-loop is then to have a creator like
Function
(NOTAbstractFunction
) that creates a new child ofAbstractMacro
that....__class__
attribute to create the new for-macro class, and then use the instance's specific IO values and connections to update the IO of the new for-macro instanceironflow
syntax of capitalization to combine this specification with values likeForLoop(Workflow.create.atomistics.BulkStructure, LATTICE=[3.9, 4, 4.1], species="Al")
or simply specifying which ones are going to get scattered and delay actual values to later likeForLoop(Workflow.create.atomistics.BulkStructure, scatter_on=("lattice",)
Function
:AbstractFunction
, we have aForLoop
that is dynamically creating a new child class of someAbstractForLoop
where the body node class and IO (including broadcast vs scattered) are all defined at the class level, and then we immediately instantiate it and are free to populate this IO (possibly using some body class instance's values, if an instance was passed instead of a class)scattered_dataframe: pandas.DataFrame
ala @JNmpi'siter
method that links the scattered input items to child node output, andbroadcast_input: DotDict
that gives easy access to all the input that is identical across all rows of the dataframe. (In principle we could return just the dataframe, but duplicating the non-scattered input like that seems needlessly memory-inefficient...)For
meta-node that has list-like channels for each of the child node's input. That's because myFor
meta-node handles only a single for loop, while @JNmpi'siter
method handles looping on multiple values in nested for-loops. Hist way is betterForZipped
andForNested
interfaces, where the nested version is like the currentiter
on #33 andZipped
zips instead of nesting the scattered input.Node.iter
method (well, nodes anditer_zipped
anditer_nested
, probably)iter
methods will need to work a little differently onWorkflow
, which is a parent-most object and not able to pass itself in as the relevant class for theForLoop
creator class, but this is an edge-case that can be defined in theWorkflow
class itself by overriding the methods.max_workers
)That's a monstrous wall of text, so let me see if I can end with a for loop syntax example
Or if we don't care about the workflow but just want to get a dataframe quickly, we could use the shortcut to say something like:
Linked issue #72
Conclusion
I've already started down this path with #265, #266, and the un-pushed work pruning off the unused macro IO nodes. I like this direction and will just keep hacking away at it until
main
and #33 are totally compliant with each other. I am very happy for any feedback on these topics!