pyiron / pyiron_nodes

Prototype node library for pyiron_workflows
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

init vasp node commit #40

Open ligerzero-ai opened 1 month ago

ligerzero-ai commented 1 month ago

initial draft of vasp nodes

currently uses pymatgen Structure as base structure objects but can easily be substituted with atoms.

Will do in a later pull

ligerzero-ai commented 1 month ago

Okay, should be ready. If anyone wants to take a look, they can look by the end of the week? I'd like to get this merged soon.

codacy-production[bot] commented 1 month ago

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
:white_check_mark: +9.05% (target: -1.00%)
Coverage variation details | | Coverable lines | Covered lines | Coverage | | ------------- | ------------- | ------------- | ------------- | | Common ancestor commit (07e2fcee62883539239aea0f8cb68e70aedb60dd) | 1246 | 2 | 0.16% | | | Head commit (2189bbd6b0c9375bc83527898dc4c5e7b25f5cb1) | 1401 (+155) | 129 (+127) | 9.21% (**+9.05%**) | **Coverage variation** is the difference between the coverage for the head and common ancestor commits of the pull request branch: ` - `
Diff coverage details | | Coverable lines | Covered lines | Diff coverage | | ------------- | ------------- | ------------- | ------------- | | Pull request (#40) | 0 | 0 | **∅ (not applicable)** | **Diff coverage** is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: `/ * 100%`

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 10638453415

Details


Totals Coverage Status
Change from base Build 10638359757: 4.4%
Covered Lines: 646
Relevant Lines: 1709

💛 - Coveralls
ligerzero-ai commented 1 month ago

Nice, @ligerzero-ai! 🚀

Resulting VASP node looks nice and I appreciate that there were unit tests for me to run against when I was merging in main. It needs to be run through a linter, and take a look at my various "best practices" ideas, but overall I'm quite happy with the VASP aspects.

One thing that I think is critical, but which we might get away with postponing to a future PR, is to compare and contrast the more generic pieces (making a working directory, running a shell command, and your Storage class) with existing tools for this -- my scipy2024 stuff, but even more some nodes Joerg has in directly in this repo. At a minimum, I think we need to restructure your more generic tools out of the vasp sub-sub-module and into a higher place, but ideally I would like to merge your ideas here with existing tools from me and Joerg. It needs to happen eventually -- soon, even -- but I can be a bit flexible whether it happens in this PR or a subsequent one. I need to go review Joerg's nodes to refresh my memory before making any more explicit comment on what/how to merge things.

My only other high-level concern is: do we need to worry about any copyright issues with hosting a bunch of pseudopotentials here? I have a vague memory of certain PPs being part of the vasp license and don't want to redistribute stuff we have no right to. Otherwise the inclusion and usage in the tests is good.

Thanks Liam -

The shell nodes and some of the more generic ones are taken from Jorg's Lampps stuff. I have modified them slightly where appropriate, but they're mostly copy-paste. I did it because I don't want any dependencies at this very early stage in the modules. When things begin to solidify, we can think of a nice architecture for the node library re. the generic nodes.

Agreed about leaving the generic nodes moving to a new home in a (soon to be made) PR.

The potentials are mockups - we don't actually upload any of the pseudopotentials. If you managed to spot a real one - let me know where?

review-notebook-app[bot] commented 1 month ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

ligerzero-ai commented 1 month ago

Ah, I see one that slipped through - I've played around with rebase ... hopefully I haven't broken anything.

samwaseda commented 1 month ago

Can this already be submitted to the cluster?

ligerzero-ai commented 1 month ago

Can this already be submitted to the cluster?

No; but you can insert this as a python program in a submission script and it will work. But I assume that is not what you are looking for.

The problem is still serialisation of nodes and communication with job allocations with queueing systems. I have no idea how that will be achieved, particularly with individual parts of a workflow, which need communication with other nodes.

It is simple enough to patch in a simple queue submission in the run_job functionality here to submit a node to the queue, but this does not play nicely with the rest of the workflow functionality, which requires communication of node dependencies.

I think that probably an appropriate solution would simply to hash the individual nodes inputs/check workdir names if they are already generated, check convergence of the job in each node attempted every time the workflow is called, and have a monitoring process (that could live inside another 1-core allocation) that sleeps and awakens periodically to progress the workflow, instead of having two-way communication living inside the nodes. So a "pull only" kind of workflow.

This is a WIP, and not implemented yet.

liamhuber commented 1 month ago

but you can insert this as a python program in a submission script and it will work.

I guess using the NodeJob class and setting this as the node will also work as a shortcut for that, but I haven't tested it on the cluster.

The problem is still serialisation of nodes and communication with job allocations with queueing systems. I have no idea how that will be achieved, particularly with individual parts of a workflow, which need communication with other nodes.

Nodes pickle (or cloudpickle, if their IO data is not pickle-able) just fine, so serialization is not a problem. The actual flow of a multi-node object is handled by Composite, a common ancestor to both Workflow and Macro.

It is simple enough to patch in a simple queue submission in the run_job functionality here to submit a node to the queue, but this does not play nicely with the rest of the workflow functionality, which requires communication of node dependencies.

Inside run there is already a separation of the full workflow (including data and execution dependencies) and the operation of the single node (whether it's a function node or a macro node). This makes the individual nodes parallelizable using any executor following the concurrent.futures.Executor format, in particular the submit method returning a concurrent.futures.Future object. The stuff passed off to the executor winds up being independent of the rest of the workflow, and the results are processed with a callback method.

My ideal solution would thus be to create a concurrent.futures.Executor-compatible interface to the HPC queue. I have not yet dug into pysqa or executorlib at all, so I don't know how close/far our existing tools are from this solution. In this paradigm you would run any parent workflow locally (i.e. head node) or as a single-process job, and then it would be alive to manage the workflow (including cyclic graphs), but the heavy work can be done for individual nodes with HPC. The big con here is needing to keep the single core job alive for a long time; it's not computationally expensive and could mostly be sleeping, but I can see it being a pain if walltimes for a parent workflow are bumping into queue waiting times for a child. Such a solution requires being able to restart from a partially-executed workflow, e.g. in case the parent job crashes and you need to go catch up to the queue-submit job again. That is not a worrying aspect for me since we already have some support for this (save-after-run checkpointing, input caching, readiness checks to stop re-running a currently-running node, ...) and the "HPCExecutor" could internally manage not re-submitting itself but rather finding an existing job in the queue. There is no problem serializing data on a given node for such an operation, as for this short-term storage (cloud)pickle is totally sufficient. In fact, concurrent.futures.ProcessPoolExecutor already leverages pickle to do this exact thing, but it just serializes in memory instead of serializing to the file system; I believe any "HPCExecutor" would need to adapt the idea to a more persistent filesystem serialization but could otherwise still rely on (cloud)pickle.

My last understanding of the direction executorlib is going was that the entire workflow would be analyzed at submission time for dependencies, and the execution dependence are then mapped directly onto the HPC jobs themselves, and the entire workflow is submit to the queue at once with one HPC job per node, then let SLURM (or whatever) manage the execution order. I think it would take some extra work to get the jobs passing data to-and-from each other with per-node filesystem serialization, but this is not a fundamental obstacle. The advantage then is that you can ship the entire (DAG) workflow off to the queue and walk away with no processes running. The two disadvantages I see are: (a) The limitation to DAGs. Overall pyiron_workflow is already a workflow manager, so passing off the entire workflow management to some other tool in the special case of submitting to HPC is a bit uncomfortable for me and you immediately lose things like our ability to handle cyclic graphs, which was specially built in to the core of the workflow code. (b) Having (a potentially very large number of) single-core jobs for all the little fast nodes each needing to sit in the queue, rather than having the single master process running all the time and able to manage these quick steps by itself without passing them to any executor at all. It's possible there is a solution to this, but it is not clear to me, and certainly I would be surprised if any solution to this problem looks substantially different than the first attack where we just keep the parent workflow running on a single node.

codacy-production[bot] commented 1 month ago

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
:white_check_mark: +4.40% (target: -1.00%)
Coverage variation details | | Coverable lines | Covered lines | Coverage | | ------------- | ------------- | ------------- | ------------- | | Common ancestor commit (99c56c352bcc93d0d9122a12ebce98fd2d70dee2) | 1554 | 519 | 33.40% | | | Head commit (87af48429185bf634d4bbc875e6869b7843f60fc) | 1709 (+155) | 646 (+127) | 37.80% (**+4.40%**) | **Coverage variation** is the difference between the coverage for the head and common ancestor commits of the pull request branch: ` - `
Diff coverage details | | Coverable lines | Covered lines | Diff coverage | | ------------- | ------------- | ------------- | ------------- | | Pull request (#40) | 0 | 0 | **∅ (not applicable)** | **Diff coverage** is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: `/ * 100%`

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

liamhuber commented 3 weeks ago

@ligerzero-ai I took the liberty to merge main into this branch again in #67 (was a simple fast-forward merge), so remember to sync your local branch to the remote here before doing further work!