pyiron / pyiron_workflow

Graph-and-node based workflows
BSD 3-Clause "New" or "Revised" License
12 stars 1 forks source link

Alternative interfaces for macro definition #85

Closed liamhuber closed 10 months ago

liamhuber commented 11 months ago

@jan-janssen wrote

Example code for discussion, initially I had:

@single_value_node("instance")
def get_instance(instance):
    return instance

@macro_node()
def internal_macro(wf: Macro) -> None:
    wf.get_instance = get_instance()
    wf.generate_structures = generate_structures(instance=wf.get_instance)
    wf.calc_with_calculator = calc_with_calculator(task_dict=wf.generate_structures)
    wf.fit = analyse_structures(
        instance=wf.get_instance, output_dict=wf.calc_with_calculator
    )
    wf.inputs_map = {
        "get_instance__instance": "instance",
        "calc_with_calculator__calculator": "calculator",
    }
    wf.outputs_map = {"fit__fit_dict": "fit_dict"}

I need the get_instance() function to connect one macro input to two function node inputs. Ideally, I would like to have something like:

@macro_node(output_labels="fit_dict")
def internal_macro(wf: Macro, instance, calculator) -> None:
    wf.generate_structures = generate_structures(instance=instance)
    wf.calc_with_calculator = calc_with_calculator(calculator=calculator, task_dict=wf.generate_structures)
    wf.fit = analyse_structures(
        instance=instance, output_dict=wf.calc_with_calculator
    )
    return wf.fit.fit_dict

With this syntax the user has all the flexibility which nodes to make available on the outside and how to link them internally. Using inspect.get_signature() it should be possible to get the input arguments, assign them to the workflow and then return the variables as links to the workflow nodes again, like:

wf.instance = instance 
instance = wf.instance 

And the output could be handled in analogy to the node definitions.

liamhuber commented 11 months ago

So there's a number of things going on here. For me the first and most important is that when you have a macro where two (or more) children are receiving the same input, there is no convenient way to synchronize these; as shown in your top example, a UserInput node needs to be explicitly added to the macro, which is a pass-through node for the shared input, and allows that input to be branched to an arbitrary number of children with regular connections. I totally agree that it's a pain in the butt to write all this, and I like the idea of using additional args (or kwargs if you want a default value) as a shorthand for this.

How do we implement such a thing? Under the hood, there is actually some formalism for how all this is happening. In this case, the relevant parts are that "connections" are only formed between input and output channel pairs, IO or OI, never II or OO. In addition to this, two channels of the same category (II or OO) can be "linked" so that value updates to one are automatically sent to the other. Currently, channels can have an arbitrary number of connections but send their value updates to at most one other channel. These features allow us to "pull" on a node and automatically construct its upstream dependencies and give us the freedom to choose whether that pulling stays inside the scope of a given macro, or punches through macro walls to get the entire dependency stack. I'm not claiming it's the only set of properties/implementation that can give us these features, just that whatever choices we make need to continue to support that functionality.

So one choice would be to soften the linking restriction so a channel can synchronize its value with multiple same-type partners. Then when we parse something like def internal_macro(wf: Macro, instance), we modify the resulting macro input so it has an extra instance channel. I really dislike this for two reasons:

  1. It bastardizes macros and function nodes; currently macro input (like workflows) is exclusively constructed as a function of its children. IMO that is a clean and beautiful thing.
  2. It complicates the syntax, since wf.generate_structures = generate_structures(instance=instance) is a "connection" creation, but now we need to actually make a "link" with that instead

Another choice would be to interpret the macro creating function in two passes: first, scrape off all args/kwargs beyond the first one and use them to create UserInput nodes with the corresponding name (and type hints/default values, if those are provided), then pass that pre-populated Macro instance to the provided function with the nodes as arguments and parse as normal. To the user this looks the same as the first attack, but

  1. Macros are still just macros, and their IO is still just build from the IO of their children
  2. Since we are now passing a SingleValueNode into the original macro-defining function, lines like wf.generate_structures = generate_structures(instance=instance) work natively

So I think this can be done and I like it as a short-hand so that users don't need to define their own UserInput nodes when they want to fork input to multiple children. The "one choice" is really here for the education of any readers that come by, but I'm being so verbose about "Another choice" also so I've got a reminder on how I want to implement it.

Then the second thing is restricting macro output with the return value, i.e. return wf.fit.fit_dict in your example. TBH, I am really not a fan of this at all. Once #84 is implemented, it doesn't really make things easier because wf.fit.fit_dict would already be getting exposed on macro.outputs.fit_dict anyhow (where macro is just an instance of the macro`), so I guess the only thing this would be doing is implicitly disabling all other output that would otherwise be scraped (i.e. other disconnected outputs of children).

  1. Implicit is better than explicit. Sometimes implicit is worth it for the convenience, e.g. as above with implicitly creating UserInput nodes, but here I'm not at all convinced it's a worthwhile tradeoff.
  2. Parsing input args is super easy. Parsing returned values is a gigantic pain in the ass. I don't want to try to implement this.

With that said, while I am happy with the existence of inputs_map and outputs_map and pleased with their simplicity and the level of customization they provide, I don't like that they are all string-based. I think this is yet again a separate issue to raise, but it would be nice to leverage tab-completable keys like wf.outputs_map = {wf.fit.fit_dict: "my_custom_label_at_the_macro_level"}

Original wish

@macro_node(output_labels="fit_dict")
def internal_macro(wf: Macro, instance, calculator) -> None:
    wf.generate_structures = generate_structures(instance=instance)
    wf.calc_with_calculator = calc_with_calculator(calculator=calculator, task_dict=wf.generate_structures)
    wf.fit = analyse_structures(
        instance=instance, output_dict=wf.calc_with_calculator
    )
    return wf.fit.fit_dict

Pros:

Cons:

Proposed alternative

@macro_node()
def internal_macro(wf: Macro, instance) -> None:
    wf.generate_structures = generate_structures(instance=instance)
    wf.calc_with_calculator = calc_with_calculator(task_dict=wf.generate_structures)
    wf.fit = analyse_structures(
        instance=instance, 
        output_dict=wf.calc_with_calculator
    )
    # And supposing there was some other output we wanted to explicitly silence
    wf.outputs_map = {"generate_structures__imagined": None}
    # Or even cooler, but is a separate issue:
    # wf.outputs_map = {wf.generate_structures.outputs.imagined: None}

Pros:

Cons: