pyiron / pyiron_nodes

Prototype node library for pyiron_workflows
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Some issues with latest pyiron_nodes and pyiron_workflows libraries #68

Open JNmpi opened 3 weeks ago

JNmpi commented 3 weeks ago

Thanks @liamhuber for the updated version of pyiron_nodes. I have used main to create _dev_joerg_frommain and made their some changes. Most of the issues when updating to this version were easy to resolve and most of the tools such as elastic_constants or phonons are nicely working.

The following workflow gives an error (AttributeError: 'str' object has no attribute 'working_directory'):

  from pyiron_workflow import Workflow   
  import pyiron_nodes as pn

  wf = Workflow('lammps_macro')
  wf.bulk = pn.atomistic.structure.build.CubicBulkCell('Pb', cell_size=3)
  wf.lammps = pn.atomistic.engine.lammps.Code(structure=wf.bulk)

  wf.bulk.pull()
  # wf.lammps.Collect.pull() # works
  wf.run() # does not work when previous line is commented out

A few other issues:

liamhuber commented 3 weeks ago

Thanks @liamhuber for the updated version of pyiron_nodes. I have used main to create dev_joerg_from_main and made their some changes. Most of the issues when updating to this version were easy to resolve and most of the tools such as elastic_constants or phonons are nicely working.

@JNmpi, super! I deleted dev_joerg then so there is only one dev branch kicking around.

The following workflow gives an error (AttributeError: 'str' object has no attribute 'working_directory'):

 from pyiron_workflow import Workflow   
 import pyiron_nodes as pn

 wf = Workflow('lammps_macro')
 wf.bulk = pn.atomistic.structure.build.CubicBulkCell('Pb', cell_size=3)
 wf.lammps = pn.atomistic.engine.lammps.Code(structure=wf.bulk)

 wf.bulk.pull()
 # wf.lammps.Collect.pull() # works
 wf.run() # does not work when previous line is commented out

I checked out your branch and copied this into a notebook, but I'm afraid I can't reproduce the error. First, I had to manually create the directory test2 or I got an assertion error -- I guess you want some "create if missing" argument that defaults to True.

After that I tried un/commenting different combinations, cleaning the test directory and restarting my kernel each time:

# wf.bulk.pull()
# wf.lammps.Collect.pull()
wf.run() 

Works

wf.bulk.pull()
# wf.lammps.Collect.pull()
wf.run() 

Works

wf.bulk.pull()
wf.lammps.Collect.pull()
wf.run() 

Works

#wf.bulk.pull()
wf.lammps.Collect.pull()
wf.run()  # OR commented out 

Doesn't work -- lammps gets a readiness error. This makes sense because you're pulling from inside the macro, which stops at the boundary of the macro by default, so wf.bulk never gets run and lammps is indeed not ready. If you are absolutely determined to pull the child like this, you can get around the stop with wf.lammps.Collect.pull(run_parent_trees_too=True), but in general I think running the outer nodes first is better form.

For readability, I would also suggest in Code (and other macro's/workflows) that the child instances should be lower-case, e.g. wf.collect = Collect(...

The nodes do not show the list of arguments/docstrings. Did you implement the annotate function in nodes? Having this feature would be great.

Sort of. The nodes capture the docstring of the functions they decorate (try tab-completing a node from the standard library to see). I like the idea of additionally capturing the annotations, although we will need to be careful to do it in a way that users see the function annotation in addition to the node annotation. Since user-facing nodes should anyhow have docstrings, implementing this is low-priority for me.

To implement the 'source' functionality in PyironFlow I used the node.node_function property, which however does not exist for macros. It would be important to have the same property for macros, also to make the functionality and behavior of nodes and macros as similar as possible.

We do: Macro.graph_creator. An analogously DataclassNode.dataclass. They all have different names because the functionality is so different, but the core concept across all three as_*_node-decoratable classes is the same and there is a way to access each 'source'.

Re. mutable arguments (e.g. calculator=InputCalcStatic(), # TODO: Don't use mutable defaults). DataClasses such as InputCalcStatic() can be nicely used as immutable arguments. I did this in my hashed-based approach. Changes to the default arguments are given explicitly (e.g. MyDataClass(a=10)) and can thus be easily hashed via the (non-default) arguments. I found this construct really helpful in implementing hashing.

If I follow, you're suggesting that all changes to the data object be passed by passing an entire new data object? AFAIK this will work, but I don't think it's a reasonable expectation to put on users (or devs). In general we can avoid it by just not using dataclass instances (or any other mutable object) as defaults.

Demo of dataclasses as a default being problematic:

from dataclasses import dataclass, field

@dataclass
class Data:
    immutable_data: int = 42
    mutable_data: list[int] = field(default_factory=lambda: [1, 2, 3])

class Foo:
    def __init__(self, data = Data()):
        self.data = data

f1 = Foo()
print(f1.data)
# Data(immutable_data=42, mutable_data=[1, 2, 3])
f2 = Foo()
f2.data.immutable_data = "Not an int"
f2.data.mutable_data = "Not a list"
print(f1.data)
# Data(immutable_data='Not an int', mutable_data='Not a list')
JNmpi commented 3 weeks ago

Thanks @liamhuber for the quick response.

Thanks @liamhuber for the updated version of pyiron_nodes. I have used main to create dev_joerg_from_main and made their some changes. Most of the issues when updating to this version were easy to resolve and most of the tools such as elastic_constants or phonons are nicely working.

@JNmpi, super! I deleted dev_joerg then so there is only one dev branch kicking around. Are the deleted branches archived, e.g., is there a way to have a look at them? Or are they completely deleted?

The following workflow gives an error (AttributeError: 'str' object has no attribute 'working_directory'):

from pyiron_workflow import Workflow   
import pyiron_nodes as pn

wf = Workflow('lammps_macro')
wf.bulk = pn.atomistic.structure.build.CubicBulkCell('Pb', cell_size=3)
wf.lammps = pn.atomistic.engine.lammps.Code(structure=wf.bulk)

wf.bulk.pull()
# wf.lammps.Collect.pull() # works
wf.run() # does not work when previous line is commented out

I checked out your branch and copied this into a notebook, but I'm afraid I can't reproduce the error. First, I had to manually create the directory test2 or I got an assertion error -- I guess you want some "create if missing" argument that defaults to True.

After that I tried un/commenting different combinations, cleaning the test directory and restarting my kernel each time:

# wf.bulk.pull()
# wf.lammps.Collect.pull()
wf.run() 

Works

wf.bulk.pull()
# wf.lammps.Collect.pull()
wf.run() 

Works

wf.bulk.pull()
wf.lammps.Collect.pull()
wf.run() 

Works

#wf.bulk.pull()
wf.lammps.Collect.pull()
wf.run()  # OR commented out 

Doesn't work -- lammps gets a readiness error. This makes sense because you're pulling from inside the macro, which stops at the boundary of the macro by default, so wf.bulk never gets run and lammps is indeed not ready. If you are absolutely determined to pull the child like this, you can get around the stop with wf.lammps.Collect.pull(run_parent_trees_too=True), but in general I think running the outer nodes first is better form.

For readability, I would also suggest in Code (and other macro's/workflows) that the child instances should be lower-case, e.g. wf.collect = Collect(...

I rebooted my computer and it is now also running for me. I have no idea what went wrong before.

The nodes do not show the list of arguments/docstrings. Did you implement the annotate function in nodes? Having this feature would be great.

Sort of. The nodes capture the docstring of the functions they decorate (try tab-completing a node from the standard library to see). I like the idea of additionally capturing the annotations, although we will need to be careful to do it in a way that users see the function annotation in addition to the node annotation. Since user-facing nodes should anyhow have docstrings, implementing this is low-priority for me.

For working with the nodes, I really miss this handy feature, so it has a high priority for me. Since all I had to do to get the annotations was to add that single line in nodes, adding it should be straightforward. For me, the in addition aspect has a low priority because I want to set the arguments in the function, not in the node class. So just having the function arguments would be fine with me.

To implement the 'source' functionality in PyironFlow I used the node.node_function property, which however does not exist for macros. It would be important to have the same property for macros, also to make the functionality and behavior of nodes and macros as similar as possible.

We do: Macro.graph_creator. An analogously DataclassNode.dataclass. They all have different names because the functionality is so different, but the core concept across all three as_*_node-decoratable classes is the same and there is a way to access each 'source'.

An important design feature in pyiron is factoring, i.e. we put a lot of emphasis on the fact that the same functionality in different classes can be accessed by the same commands. This design feature lowers the entry barrier for users, but also makes the development of a GUI etc. easier to implement and read. I would therefore strongly advocate having a generic keyword for the same functionality, independent of the specific class. Besides the generic keyword, we can also keep class-specific keywords, but I do not see any real advantage.

Re. mutable arguments (e.g. calculator=InputCalcStatic(), # TODO: Don't use mutable defaults). DataClasses such as InputCalcStatic() can be nicely used as immutable arguments. I did this in my hashed-based approach. Changes to the default arguments are given explicitly (e.g. MyDataClass(a=10)) and can thus be easily hashed via the (non-default) arguments. I found this construct really helpful in implementing hashing.

If I follow, you're suggesting that all changes to the data object be passed by passing an entire new data object? AFAIK this will work, but I don't think it's a reasonable expectation to put on users (or devs). In general we can avoid it by just not using dataclass instances (or any other mutable object) as defaults.

Demo of dataclasses as a default being problematic:

from dataclasses import dataclass, field

@dataclass
class Data:
    immutable_data: int = 42
    mutable_data: list[int] = field(default_factory=lambda: [1, 2, 3])

class Foo:
    def __init__(self, data = Data()):
        self.data = data

f1 = Foo()
print(f1.data)
# Data(immutable_data=42, mutable_data=[1, 2, 3])
f2 = Foo()
f2.data.immutable_data = "Not an int"
f2.data.mutable_data = "Not a list"
print(f1.data)
# Data(immutable_data='Not an int', mutable_data='Not a list')

When designing the pyiron_node library, I found it extremely helpful and elegant to use the dataclass object as a default for function input. An example is the InputCalcMD dataclass, which contains all the default parameters. To change them, the user just has to provide InputCalcMD(temperature=400), i.e. the changes are directly reflected in a kwargs dictionary. If we sort the keywords alphabetically, we get a completely immutable object. Since we are talking about objects that are linked to user input, there will be no long lists or np.arrays (which would indeed be mutable). The only exception might be something like n_repeat, which is a 3D vector. For such cases we can use tuples, which are immutable.

liamhuber commented 3 weeks ago

For working with the nodes, I really miss this handy feature, so it has a high priority for me. Since all I had to do to get the annotations was to add that single line in nodes, adding it should be straightforward. For me, the in addition aspect has a low priority because I want to set the arguments in the function, not in the node class. So just having the function arguments would be fine with me.

Do you mean '__annotate__': node_function.__annotate__ if hasattr(node_function, '__annotate__') else None? I have this line in my notes. I had to switch __annotate__ to __annotations__, which then successfully propagates the function annotations over to the node class, but this still doesn't show up for me in tab completion. Perhaps it additionally requires some Jupyter magic?

I'm anyhow not comfortable with anything that uniformly override node documentation/completion with stuff from the underlying function. The node is always a node, and especially at instantiation it is critical that users be able to see the node input as well. Therefore any function documentation/signature captured at the node initialization stage can only be in addition to existing node info.

I fully agree we need to do more scraping of decorated object signature data and to display it more aggressively in tab-completion and docstrings, but I just don't think this is trivially straightforward (at least not getting it to reliably play nicely with Jupyter) and so I am not going to be able to get it done promptly.

An important design feature in pyiron is factoring, i.e. we put a lot of emphasis on the fact that the same functionality in different classes can be accessed by the same commands. This design feature lowers the entry barrier for users, but also makes the development of a GUI etc. easier to implement and read. I would therefore strongly advocate having a generic keyword for the same functionality, independent of the specific class. Besides the generic keyword, we can also keep class-specific keywords, but I do not see any real advantage.

I fully agree there is room for an abstraction here. However, it's not immediately clear to me what that abstraction should be or how to implement it. What nodes get a "source"? What nodes don't? How do we signify the availability/requirement of such a source, by inheritance from some HasSource class? More generally how should ?? behave on nodes that don't have a source?

Naively function, macro, and dataclass nodes should have this, at a minimum. However, each of these already make the necessary data available on a case-by-case basis, and the different variable names make it obvious why each of these nodes holds this data for their use case.

Without some underlying reasoning, I'm loathe to introduce "abstractions" that are arbitrary special behaviour, as in the long run I think this makes things more rather than less complicated. Since the desired data is anyways accessible, I'd rather not rush into something.

When designing the pyiron_node library, I found it extremely helpful and elegant to use the dataclass object as a default for function input. An example is the InputCalcMD dataclass, which contains all the default parameters. To change them, the user just has to provide InputCalcMD(temperature=400), i.e. the changes are directly reflected in a kwargs dictionary. If we sort the keywords alphabetically, we get a completely immutable object. Since we are talking about objects that are linked to user input, there will be no long lists or np.arrays (which would indeed be mutable). The only exception might be something like n_repeat, which is a 3D vector. For such cases we can use tuples, which are immutable.

I cannot recommend going down this path. You're right that to provide an entirely new InputCalcMD to a node they will get a clean instance, but that doesn't stop your mutable default from being a mutable default. E.g. you may not be able to get to this state with regular data flow through a workflow, but there is absolutely nothing stopping a user from taking shortcuts and getting completely confused by the side effects of mutability:

import pyiron_nodes as pn

n1 = pn.atomistic.engine.lammps.CalcMD()
n2 = pn.atomistic.engine.lammps.CalcMD()
print(n2.inputs.calculator_input.value.temperature)
# 300
n1.inputs.calculator_input.value.temperature = 0
print(n2.inputs.calculator_input.value.temperature)
# 0

At the end of the day, I feel the minor convenience we get for instantiating the default right in the signature instead of waiting for the function body to say my_data = MyDataclass() if my_data is None else my_data is just not worth the price we pay for having a mutable default.