pyiron / ironflow

Prototype of a graphical user interface for pyiron (unstable)
https://mybinder.org/v2/gh/pyiron/ironflow/HEAD?labpath=example.ipynb
BSD 3-Clause "New" or "Revised" License
19 stars 2 forks source link

Abstraction between representations and calculations/evolutions #109

Open liamhuber opened 2 years ago

liamhuber commented 2 years ago

From our conversation in #102

The following point is realted to pyiron, but required to get nice solutions for ironflow: It would be good to have for some pyiron constructs an alternative formulation. An example is

    job.calc_md(temperature=300)  # previous
    job.calc = calculator.md(temperature=300)   # alternative

Such a formulation would allow us to easily create generic modules, e.g. for MD that can be used as input for VASP, LAMMPS, etc. The same applies to job.server etc.

Yep. I will probably hack-and-slash something to this effect after I get the book-keeping (CI, pypi release, conda-forge release) all sorted out. My current vision is that things like the Lammps node can only ever run a static calculation (or no running of these nodes??) and that they otherwise need to be passed as input to MD, minimimization, NEB, VC-SGC, etc. calculators.

This is thus a little bit different from what you suggest here, something more like:

md = pr.atomistics.calculator.MD('my_md_run')
md.temperature = 300
md.n_steps = 1000

lammps = pr.atomistics.representation.Lammps('lammps_reference')
lammps.structure = pr.atomistics.structure.bulk('Fe').repeat(4)
md.representation = lammps

md.run()

In a perfect world I imagine that the calculator would then see if the representation has native support for this type of calculation, e.g. lammps has calc_md so we would basically just be using the calculator as a wrapper interface for that -- but then if the representation doesn't allow it that the calculator either falls back on a python-coded version of the algorithm, or throws an error (probably when the representation is assigned, so we fail as early as possible).

This is definitely best implemented right in the heart of pyiron, but I think it will be productive to play around with the idea here in ironflow where I'm just making it look like that's what the underlying model is. That will give us some freedom to experiment with different architectures/interfaces without breaking anything anyone else cares about.

liamhuber commented 2 years ago

I was a bit verbose here, but there's a summary at the end.

Thanks for your quick and very constructive reply @liamhuber. Just a few thoughts regarding the calculator.MD idea. At a first glance it looked to me like your suggested version makes things more complex than what we have now. In the example given by you a single line

 job.calc_md(temperature=300)

would be replaced by 4 lines. In a jupyter notebook version (or general in a python script) I would not like to have it (or at least I would prefer to keep our present single line approach as one possible option).

@JNmpi I'm certainly open to leaving wrapper methods on different objects that allow the user to set multiple attributes in one line. Another idea would be to allow users to set attributes directly in the instantiation, e.g.

job = pr.atomistics.job.Lammps('my_lammps_job', structure=pr.atomistics.bulk('Al'), potential='my_fav_mendelev_pot')

But honestly, that above example demonstrates why I'm not totally comfortable with how we use calc_md, calc_minimize as they currently exist -- everywhere else we require setting input on a per-property basis. At the very least we should try to make the interfaces more uniform across different applications, whether that forbidding anything other than direct attribute setting, or adding multi-attribute setter methods where they're missing, e.g.

def GenericDFTJob(AtomisticGenericJob).define_representation(
    energy_cutoff: float = None,
    kpoint_mesh: list[int]: None,
    ... 
) -> None:
    ...

However, for a visual node based approach your pseudo code would work very well. We reduce the number of inputs to the LAMMPS node and make the MD module generic. The workflow would then look like:

structure -> LAMMPS -> MD

Based on this concept we could/should also convert other inputs into the LAMMPS job into nodes. Specifically, we could have the following flow:

  structure -> LAMMPS -> POTENTIAL -> MD -> SERVER

The advantage of this flow would be that the potential module has all the critical information (i.e. LAMMPS and structure). This construct is much cleaner than the original one, since we do not have to test that one input (structure) is there to generate another input (potential). The same applies to SERVER, where the info regarding the queue, number of cores etc. are provided. Our present LAMMPS job could be regarded as a macro node of this workflow, which collects all the input parameters of the inputs of the various nodes in the macro object, i.e., structure, potential etc.

From a philosophical perspective I think this is absolutely the right way to do it. But I would call it Representation rather than Potential, and these would be things like LJ, EAM, ACE, GGA, etc. However, I'm not convinced this linear hierarchy actually solves the need to test one input against the other. If I build up a graph of Structure -> Lammps and then put down an EAM node and send Lammps->EAM, the EAM node still needs to internally compare the library of Lammps EAM files to the structure carried by the Lammps object and provide me with some sort of drop-down menu to choose a specific EAM file -- but we run into exactly the same case that it's possible for no such valid file to exist.

I would also need some sort of protection that I couldn't pass a Vasp->EAM, nor Lammps->GGA. And I can pass Lammps->ACE, but can I pass Gromacs->ACE, even though ACE potentials are classical and Lammps and Gromacs are both classical engines?

Practically, I think this setup will wind up feeling a bit counterintuitive, since developers will need to keep potentials up to date with which engines they support (instead of engines with which potentials are possible), and since users will need to thing "I want to run a GGA calculation" (instead of "I want to run a VASP calculation").

As a bit of an aside, I also think it might be a bit awkward putting Server at the end of the chain. In my mental model, the object at the end of the end of the chain is the thing we call run on and the thing whose output we're interested in. We absolutely could have that be the server, but then we'd either need to dynamically create output for the server (e.g. by directing to the calculation output), or explicitly dig deeper like server.calculation.output. Since our server object is quite simple (queue, cores, walltime), I think it's nice to use this as input to the calculation rather than the other way around. Plus, we might often use the same server settings for multiple calculations, but it's quite rare to feed the same calculation to multiple server settings, so re-usability is nicer with the server as input.

These thoughts also show that for supporting the two very different approaches for workflows (visual node based one vs command like python based one) different code constructs are needed.

To the extent that the visual scripting approach will probably want to flatten some inputs out instead of having many smaller nodes (a sort of macro as you suggest), I agree -- in command line it's nice to have something like job.server.cores, but graphically we probably just want node.cores instead of forcing the user to instantiate a server-type node. Beyond that, I still think that both the graphical and command based interfaces would benefit from the same underlying architecture/code construct.

From my perspective there are a handful of interdependent pieces for an atomistic calculation:

These are the necessary pieces for converting positions and species into forces and energies. Those can then be used to do any number of physical "calculations".

Finally, for very practical reasons we also need to specify the computational resources for the engine and/or calculation, let's just keep calling these the "server".

Right now, unfortunately, IMO we have the worst possible setup -- namely we have a mixture of two setups! Sometimes, we mangle the engine and calculator right together, EngineCalculator(HasStructure, HasModel), e.g.:

engine_calculator = pr.atomistics.vasp('mangled')
engine_calculator = [1, 1, 1]  # engine options
engine_calculator.potential = 'PAW-PBE'  # model
engine_calculator.structure = pr.atomistics.structure.bulk('Cu')  # structure
engine_calculator.calc_md(temperature=300, n_steps=10)  # implicit calculator
engine_calculator.server.cores = 4 # server
engine_calculator.run()  # Let's us know this object is the king, the thing that holds the output we want

Other times we break them apart, Calculator(HasEngine); Engine(HasModel, HasStructure), e.g.:

engine = pr.atomistics.Lammps('separated_engine')
engine.structure = pr.atomistics.structure.bulk('Cu')  # structure
engine.potential = engine.list_potentials()[0] # model

calculator = pr.atomistics.PhonopyJob('separated_calculator')
calculator.ref_job = engine  # In this case, the calculator will update the structure for the engine
calculator.some_options = "something, whatever"

I like this second setup better, although obviously if the engine has some sort of built-in and very efficient tool for running the calculation it's desirable to use that under the hood.

Summary

In the abstract, these are the main architectures floating around my head now:

JNmpi commented 2 years ago

@liamhuber, thanks for your thoughts and detailed description. I definitely also see the need to unify the syntax for providing/modifying input and parameters. I would also opt that users can chose among multiple options, i.e., they should be able to use a construction they prefer. To put this into pseudo code:

job.calc_md(temperature=300, n_print=100)

should be identical to something like:

job.set.calc.md(temperature=300, n_print=100)

This approach could be easily generalised to structures, potentials etc.

job.set.structure.bulk('Al')

I also like your suggestion:

job = pr.atomistics.job.Lammps('my_lammps_job', structure=pr.atomistics.bulk('Al'), potential='my_fav_mendelev_pot')

To unify the input we could allow the following constructs simultaneously:

job.input.incar['encut'] = 100

job.input.incar.encut = 100

job.input.incar(encut=100)

Thanks also for your detailed thoughts and discussion regarding the structure of the engineCalculator. I fully see your point when you define a node like EAM for a specific set of potentials. What I had in mind was to create a generic potential node that receives all its information it needs from the calculator, i.e., sometehing like:

 def potential(job):
        return job.list_potentials()

Thus, when replacing one job (i.e. Lammps by VASP) or one structure by another one the list gets updated. In the worst case no potential exists, but this could be easily handled. Since the structure is an input into job this construct would be a true functional object, i.e., changing within one node an input will have no impact onto another one (which it presently has). The sequence would then be:

  structure -> LAMMPS -> POTENTIAL -> MD (-> SERVER)

I fully agree that we may/should still provide a macro node that groups all the inputs together like we have it now.

To formalise this discussion, a node should be purely functional, i.e., node(i1, i2) should rely on independent input i1 and i2, not on something like i1 is a function of i2, i.2. i1=i1(i2). Having this notation of a node being a pure function and a workflow being a sequence of these functions would make it easier to check the validity of certain constructs.

In this way I also have a slight preference for constructing first the full job (in this case MD) and sending this object then to the server (SERVER). A possible scenario would be that depending on the job type (i.e. LAMMPS vs. VASP) we could easily assign different resources.

The above discussion are just a quick summary of my thoughts and I am happy to get your insights.

liamhuber commented 2 years ago

Input syntax unification

I'll make a new issue for this over on base or atomistics summarizing and linking this discussion.

Nodes

To formalise this discussion, a node should be purely functional

This is a really superb point. I wasn't thinking about it clearly in those terms, but I completely agree. I'm not sure I agree on the definition of functional though:

a node should be purely functional, i.e., node(i1, i2) should rely on independent input i1 and i2, not on something like i1 is a function of i2, i.2. i1=i1(i2)

From my perspective this is simply a matter of certain input combinations being (in)valid, which is totally fine in a functional paradigm. In this view, our current Lammps node then just provides a helpful UI layer on top of this functional node so that the user can avoid these impossible combinations.

For me what is important with functionality is that the node produces the same output given the same input, regardless of how many times it's executed. A corollary to that is that the node is node modifying some sort of global (or at least super-self) state. Here our Lammps node needs some work, since right now I think we can modify the input, hit "run", and the job will just re-load itself! But that has to do with input locking or setting delete_existing_job=True rather than the fact that the potential input is conditional on the structure input.

The sequence would then be:

  structure -> LAMMPS -> POTENTIAL -> MD (-> SERVER)

...

In this way I also have a slight preference for constructing first the full job (in this case MD) and sending this object then to the server (SERVER). A possible scenario would be that depending on the job type (i.e. LAMMPS vs. VASP) we could easily assign different resources.

It's definitely possible for this setup to align with my definition for functional nodes, but I'm not sure it's the most efficient. For instance, POTENTIAL does need to see an Atoms object; I'm not totally comfortable with POTENTIAL peeking at LAMMPS.input.structure, but avoiding this would require that the structure get copied (including when we serialize the graph state) in LAMMPS.output.structure.

I have an even deeper concern with the idea of "sending [the MD job] to the server". From my perspective it would be un-functional if SERVER.execute() updated MD.output. Again, this is avoidable if SERVER doesn't touch the MD object itself, but internally makes a copy of anything it needs to copy (like using a ref job) and writes all of the job's output to its own output field. In principle this is possible, and I'm not 100% closed to putting SERVER at the very end, I just think that we would need to then do a bunch of work to make sure that the SERVER.output formats nicely and is usable, regardless of what type of calculation we pass to SERVER.input.calculation.

Ultimately, form a "functional perspective" I'm unconcerned if changing i1 in node1(i1, i2) results in an invalid input combination (or gets automatically patched over to change i2, as long as i2 is free-floating and not wired up to any other node's output), but rather I don't want node1.update() to have an impact on anything other than node1.output, i.e. not node7.output, and certainly not node7._internal_state.

JNmpi commented 2 years ago

@liamhuber, thanks for formalising the discussion. Below a few thoughts to the various issues:

a node should be purely functional, i.e., node(i1, i2) should rely on independent input i1 and i2, not on something like i1 is a function of i2, i.2. i1=i1(i2)

From my perspective this is simply a matter of certain input combinations being (in)valid, which is totally fine in a functional paradigm. In this view, our current Lammps node then just provides a helpful UI layer on top of this functional node so that the user can avoid these impossible combinations.

I realize that my notation was insufficient. The main point I wanted to make is that i1 is not only an input of i2 but also of the node n1 itself, i.e.

i2 = i2(i1, n1)

To be specific, the potential depends not only on the structure, but also on the calculator, i.e., whether Lammps, VASP etc. are used. To prevent that the input depends on the node in which it is used was the starting point for my thoughts. The question whether an input i1 depends on the node itself or not depends on the abstraction level. If we want to just use code specific input we don't need it. However, for generic constructs like MD or MD we need to know the node (code) type to translate the generic input into a code specific one. Whenever we have such a situation (which imo is the case for any generic input) we need to resolve the implicit node dependence into an explicit one, i.e.:

n1(i1) -> n2(i2)

In our case n1 would be the calculator (LAMMPS, VASP etc.) and n2 the potential.

These considerations address also your second point:

It's definitely possible for this setup to align with my definition for functional nodes, but I'm not sure it's the most efficient. For instance, POTENTIAL does need to see an Atoms object; I'm not totally comfortable with POTENTIAL peeking at LAMMPS.input.structure, but avoiding this would require that the structure get copied (including when we serialize the graph state) in LAMMPS.output.structure.

In practice, the potential node does not need to peek into structure, but that node1 provides a function to create the input for node 2. Again for the case of the potential this would be simply the list_potentials() function. n2 would be the a high level container that allows the user to select only potentials that are consistent with the code described by node n1.

When thinking about this I don't see a difference to the MD node. If we would use the LAMMPS specific notation/input file we could use the same argument as for the specific potential, i.e., it would be an input and not a separate MD module.

liamhuber commented 1 year ago

Thanks for clarifying, @JNmpi! The explicit example with a Potential node as basically just a container was super helpful. Actually I am slowly becoming convinced that there is no fundamental implicit dependence going on either way, although I agree that such implicit dependence would be bad and we need to be careful not to add it where it doesn't need to be. So I think both the "implicit" case where Lammps takes a potential as an input and the "explicit" case where the Potential node exists by itself are fundamentally equivalent.

Suppose, in the case that the Lammps node directly takes structure and potential as inputs, that we enter a Cu structure and an Al potential, this is simply an invalid combination of input and the node won't generate a valid output object. If the potential input is explicitly provided on the graph, e.g. from the output of a String node or something, then it's game-over and the Lammps node won't generate output. On the other hand, if that input port is open, then we could be (and currently are) user-friendly and update this to the 0th choice of available potentials for the current structure. I have to confess, this is really getting close to the line for "functional node". In principle the node should not carry any internal state, and my argument here is that whether or not it has a particular port connection is external graph state. Even this level of hand waving is, in principle, avoidable by leaving the node input in the "last state" and just letting the output fail, and then forcing the user to select a new, valid potential. Automatically updating the (not connected) input effectively just skips over that last human-involved step, so I'm comfortable with it.

The catch is, I don't see a big difference with the "explicit" version where it's broken down to Structure -> Engine -> Potential. In practice, the Potential node really has two inputs: available_choices and choice_index. So if we first feed it an engine that has many choices, select one with a high index, and then feed it a smaller set of choices from a different engine, then we're in exactly the same place we were before -- technically the node output is now broken (e.g. forced to be None), and in practice it's very user-friendly to just update choice_index to 0.

So in both cases it's always possible to recover a sort of "pure functionality" by simply letting the nodes break when one input updates to be incommensurate with another one, but in both cases there is a user-friendly "hand-wavingly functional" approach where (iff the node input in question is not explicitly wired to another node's output) we automatically update the input to something useful.

The good news is that if both the "implicit" and "explicit" topologies really are equivalent in this way, we can just be super pragmatic in choosing how we implement it. In this case, I think the connection between a Lammps classical potential and a Vasp pseudo-potential is actually quite weak, and I'd just keep them split up as inputs to their respective engines. (Otherwise what would we do with, e.g. Wien2k? It feels like these codes, but wouldn't make sense as input to a Potential node.)

Ryven itself is not fundamentally functional at all; for instance when we naively set up the BulkStructure node, if you had a cubic Al cell, and tried to update the species to Mg, then the input does change to Mg but the output fails and stays as an Al cell. Since the initial cell could have been any (cubic) structure, you get (semi)arbitrary output with the same input element=Mg, cubic=True. I fixed this by re-setting the output to None when the structure generation raises an error. I think this is a pattern we should adopt more widely.

In the latest branch (#131) I am now explicitly doing type checking on ports, and added an attribute all_input_is_valid to nodes, which makes it easier to write nodes in this way. E.g., just adding if not self.all_input_is_valid: SET_ALL_OUTPUT_TO_NONE at the end of update already covers over a lot of non-functional cases.

Right now this type checking is still very primitive based on classes and values, but soon I'll start on the NumFocus "ontologicalization" work and we'll see how much power we can milk out of it.

JNmpi commented 1 year ago

Thanks @liamhuber. I fully agree that the two cases are fundamentally equivalent. Since both have there pros and cons we should provide both options to the user. This approach would follow the python philosophy - provide all modern concepts of computer languages, rather than forcing users into a certain paradigm (e.g. functional only). In this way the users can come up with best practice examples and decide what works best for them.

It may be helpful to remember how we got into this issue: It started with your idea to put MD after the Lammps job. I first didn't like it, since it makes the code in the jupyter notebook longer and less compact. However, the more I thought about it the more I realised that this is the way to go when attempting to replace specific input by generic (input) modules. While these generic modules are great for our workflow concepts it is probably a good idea to provide also the more compact input option to please the more application oriented users.

I thought also a bit about use cases where one or the other formulation would be more useful. A particular important use case for workflows are parametric studies, where loops over various input values have to be performed. For the potential case, a specific scenario would be to go over all available potentials for a given atomic structure. In the formulation

structure -> AtomicCalculator -> Potential

one could add an optional input all. If this parameter is set to True the Potential node would create and run all corresponding (e.g. Lammps) jobs. In this respect the Potential node is not only a container for list_potentials but is derived from something like a ParallelMaster, i.e., it can create and run these jobs.

Directly related to the above thoughts it may be helpful to implement a new data type Batch or BatchData. If a job creating node receives as input such a variable it would create a bunch of parallel jobs. An example would be a batch of structures (e.g. our structure container) or the above discussed set of potentials. The advantage of this approach would be that we don't need any explicit loops or indexing, i.e., we get away with easy and intuitive graphs. The possibility to avoid explicit indexing is similar to what numpy is doing and is the basis for very compact and well structured code.

To summarise the above thoughts, the following workflow example would be possible:

Batch(StructureContainer) -> Batch(AtomicCalculators) -> Batch(Potentials)

would create and run a three dimensional set of jobs going over all structure, atomic calculators as well as available potentials for each calculator, without having to setup a single for loop.

liamhuber commented 1 year ago

@JNmpi super! I think we are basically in agreement. It was helpful for me to go through this in detail though -- I was not even thinking in terms of "functional" nodes, but even if we present users with fancy UI that hides this aspect, it will still be helpful to design (most if not all) nodes from a functional perspective.

I think we're still on slightly different pages with Potential, but I feel this is more of a difference of opinion in the underlying ontology than in ironflow; your point about being able to parametrically scan over potentials is very well taken. I think we can defer most of that to a future discussion on atomistics ontology.

I really like the Batch idea. It's quite similar to something that I worked on over in pyiron_contrib.protocol, where we had List vertices to wrap other vertices -- the input parameters would then either be singular values which were sent to the wrapped node direct, or list-like parameters with the same length as the number of node children that got broadcast to each child separately. Here we want users to be able to build batched nodes directly from the GUI instead of in code, which differntiates it from the old Protocol work and the way you describe it is better for this context.