Add checkpointing capability to GROMACS for jobs that exceed a system's max walltime.

neffrw commented 4 weeks ago

Scott reported an issue where Frontier’s max walltime is too short for the GROMACS job. We need to add functionality in PTM-Psi allow for job restarting. GROMACS supports job checkpointing, but waiting until the max walltime to see if a job finishes would mean the re-queued job would have to wait in the queue all over again.

Challenges

We don’t know how many ns/day a job will take until after it has run for a bit.
Waiting reactively and re-queueing after a job finishes would mean it has to wait in the queue all over again.
Conversely, only one extra job is allowed to age in the queue. Proactively queueing a re-run to start from the checkpoint that isn’t needed wastes time, as the next step could have been aging instead.

Objectives

Only add “re-run” jobs to age in the queue if estimated as necessary.

Solution (via @dmejiar )

I am thinking that in the general case we could queue two jobs:

the first one will run everything up to the line with -deffnm fnpt , will read all fnpt.log (in case of multiple runs in the bundled together), and will write an estimate of how many restarts are needed.

the second one will be dependent on the first job, will read the file with the number of restarts estimate, will queue as many restart jobs as needed (each one with the appropriate dependency), and will start the first MD run

In this way, the restart jobs will age in the queue for at least 24 h.

Everything up to -deffnm fpt should only take up to 1 hour, and we will get a good idea of the performance there (in terms of ns/day). We can use this information to estimate the execution time of remaining jobs and queue the appropriate “backup jobs” as needed.

neffrw commented 4 weeks ago

Plan:

We queue the first job, which does everything up until fnpt.
We also queue one md job and set that dependency. We store this job id in a file called "md.jobid" or something.
When the first job finishes, it will use fnpt.log to calculate the number of hours needed to compute it minus the number of hours of the first md job.
Then, it will read md.jobid and queue however many more jobs are needed with the appropriate dependency chains (starting with md.jobid)
Finally, it will either call the lambda jobs based on the final md job id, or not call them (depending on what the user prefers, this can be a boolean argument for gen_ptm_files or gromacs_options).

Only issue I see here is for update_topology.py. If, for whatever reason, the md jobs don't end up reaching the final step, then update_topology.py won't run, and the lambda jobs will fail.

We could have it check slurm to see how many "md-uuid" jobs are pending instead of running grep on md.log for the step count. If no more are pending, then it knows it's the last one and it should run update_topology.py

neffrw commented 4 weeks ago

Instead of checking slurm, I will update the md.jobid file with the final job id to avoid a dependency on the type of job scheduler. If this equals its own job id, then it knows it's the last job and can run update_tology.py

pnnl / PTMPSI

Add checkpointing capability to GROMACS for jobs that exceed a system's max walltime. #22