jan-janssen / LangSim

Application of Large Language Models (LLM) for computational materials science - visit jan-janssen.com/LangSim
http://jan-janssen.com/LangSim/
BSD 3-Clause "New" or "Revised" License
55 stars 12 forks source link

tool that interfaces with scheduler for long-running tasks #33

Open ltalirz opened 6 months ago

ltalirz commented 6 months ago

Motivation

The current implementation of tools works for fast toy calculations, but scientifically relevant calculations in chemistry and materials science often make tradeoffs between compute cost and accuracy that results in calculations that run several hours or days, even on powerful hardware.

In the current implementation, the notebook will be blocked for the time of the calculation the calculation will be killed once the ipython kernel is stopped.

We would therefore like langsim to be able to submit computationally intensive tasks to remote scheduling systems, check the status of these calculations, and retrieve the result once they have completed.

Thoughts

I think this is a tough one to make user friendly, particularly if you think about the original target audience: an experimentalist wanting to run calculations. Do we ask them to install slurm on their local work station (they may be running Windows)? Do they need to apply for computational time on a HPC resource (and then figure out how to run the simulation code they need there)? I think with such asks we already lose a large fraction of the target audience.

The only feasible way I see for letting someone without HPC expertise run on HPC is either

That said, adding the basic functionality for interacting with schedulers is certainly feasible, if the user can provide all necessary information (credentials for connecting, scheduler type, partitions you have access to, where codes are located, etc.).

There is some light at the end of the tunnel, as also academic HPC centers are moving from giving users SSH access to REST APIs (example), but this process is still underway and to my knowledge no clear standard has emerged.

Also, none of the APIs I've seen so far offer a mechanism for discovering the simulation codes that are installed and how to module load them... perhaps we could draft a specification for how we would like such an API to look like and then approach HPC centers with this idea.

[1] Or, if that is not available, some HPC cluster template with pre-installed software in standard locations (e.g. there are interesting efforts like the CernVMFS build cache from Compute Canada or also the spack build caches), but that already adds a lot of complexity.

chiang-yuan commented 6 months ago

I recommend prefect.io for pythonic way to submit custom and monitor pythonic jobs! It also supports different job management systems and can be orchestrated both locally and on HPC

jan-janssen commented 6 months ago

Over the last two years I worked on library as part of the exascale project to address this challenge. We are currently in progress of merging different parts together. But basically it follows the concurrent futures executor design from the python standard library and extends it with the option to assign HPC resources like GPUs, MPI-parallel codes and thread parallel codes as well as use the future object of one function as an input of the next function to realise dependencies: https://pympipool.readthedocs.io/en/latest/examples.html#coupled-functions This currently works inside the allocation of a given queuing system using the flux-framework scheduler and we are extending it to run outside the queuing system: https://github.com/pyiron-dev/remote-executor/blob/main/example.ipynb In that case the queuing system is handling all the dependencies of the individual tasks, so no daemon process is required.