Closed neffrw closed 3 weeks ago
Plan:
Only issue I see here is for update_topology.py. If, for whatever reason, the md jobs don't end up reaching the final step, then update_topology.py won't run, and the lambda jobs will fail.
We could have it check slurm to see how many "md-uuid" jobs are pending instead of running grep on md.log for the step count. If no more are pending, then it knows it's the last one and it should run update_topology.py
Instead of checking slurm, I will update the md.jobid file with the final job id to avoid a dependency on the type of job scheduler. If this equals its own job id, then it knows it's the last job and can run update_tology.py
Scott reported an issue where Frontier’s max walltime is too short for the GROMACS job. We need to add functionality in PTM-Psi allow for job restarting. GROMACS supports job checkpointing, but waiting until the max walltime to see if a job finishes would mean the re-queued job would have to wait in the queue all over again.
Challenges
Objectives
Solution (via @dmejiar )
Everything up to -deffnm fpt should only take up to 1 hour, and we will get a good idea of the performance there (in terms of ns/day). We can use this information to estimate the execution time of remaining jobs and queue the appropriate “backup jobs” as needed.