Standardized approach for job checkpoint/restart/continuation

jmmshn commented 2 years ago

Hi @mkhorton @davidwaroquiers @mjwen. Agreed that we need a proper discussion about restarting. This is something @jmmshn has opinions about too.

I have some ideas for how we could do it but each with their own tradeoffs. I'd be interested to hear if there were any good ideas from your workshop @mkhorton.

I will create a separate issue to discuss this further.

Originally posted by @utf in https://github.com/materialsproject/atomate2/issues/134#issuecomment-1184215723

jmmshn commented 2 years ago

I also posted this on the forum but I don't think that shows up here: https://matsci.org/t/atomate2-restart-for-long-running-hse-jobs/42345

mkhorton commented 2 years ago

Ok, I wanted to recap two possible approaches. Caveat that some of this is a carry over from FireWorks discussions, so may not be appropriate for jobflow.

(Numbering the following to make it more easy to reference in replies.)

Support for an individual job to enter/exit a checkpointed state
- Advantage maybe that this more clearly communicates the state of the job, and avoids a situation whereby a jobflow could become excessively long with multiple continuations.
Standardized interface for dynamic addition of continuation jobs
- Advantage here is that it is conceptually simpler, and perhaps fits better with the jobflow model.

For both approaches, there needs to be a standardized way to initiate a checkpoint (e.g., the approach we trialed previously was listening for a SIGUSR1 which would warn of an approaching walltime, since this is supported by several HPC systems), a way to then verify that the request to checkpoint has completed successfully, and to continue from the checkpoint.

Prior art:

FireWorks checkpoint PR https://github.com/materialsproject/fireworks/pull/423 (not merged, but functionally complete up until entering the checkpoint state)
A Custodian approach which is code-specific, which allows a job to automatically continue if it detects the presence of checkpoint files in the launch directory, but was not integrated into FireWorks

Some subtleties to think about:

Do we need to distinguish between a checkpointed state where it is required that the continuation happens on the same HPC system, e.g. due to large files present that cannot or might be cumbersome to include in the jobstore? (This might favor approach 1.) And, conversely, those continuations (e.g. storing a CONTCAR) that could be portable between HPC systems?
Do we need to make any special considerations for tools like MANA in the design, which may be available in the future?
Do we need to make any special consideration for HPC systems which have flex queues, and have mechanisms to auto re-submit a job if it does not complete?

Questions 6 + 7 I think are likely not relevant here, but I mention for completeness.

My own view here is that standardizing on a pattern and documenting that is more important than the specific approach taken, and that it is very important we get this right. Workshop takeaways were varied, but essentially we're not the only people having this issue, and it's a priority.

jmmshn commented 2 years ago

@mkhorton so I think there are two different problems to solve:

copying contcar to poscar when restarting relaxation runs.
checkpointing of the actual vasp calculation

The majority of compute time that I have been wasting for the last couple of years has been on long-running relaxation jobs where restarting a failed relaxation always starts from the initial structure so I think all relaxations jobs that did not finish can be considered in a "checkpointed" state even though they don't engage in with any formal checkpointing system other than the CONTCAR file. Solving 1. would fix the problem will lengthy relaxations but would not really help MD runs (unless they can be stitched together?) But that would basically solve all of my problems now.

mkhorton commented 2 years ago

I hear you, if we just want to concern ourselves with contcars/relaxations, it's a much easier problem to solve (questions of wasted compute due to badly-progressing optimizations aside), and dynamic addition of additional jobs seems the way to go. But I think the question remains of how to formalize this, what pattern to adopt, etc?

jmmshn commented 2 years ago

Yeah, I think we should have a quick chat about this as a group next week?

mjwen commented 2 years ago

I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.

jmmshn commented 2 years ago

I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.

This also gets complicated by the fact that (I assume) most "restarts" will be initialized from lpad rerun_wfs so maybe we just need a restart flag like ISTART in VASP that dictates how the calculation should behave given different available data in the directory. This can be something added to all the VASP makers.

EX.

For relaxation, you will just copy over CONTCAR to POSCAR if it is available (some additional consideration for what we call INCAR.orig might be needed ... but maybe not since we have already put so much effort to make all the builders provenance agnostic)
For long-running MD simulations, you might want to store the previous history in the task somewhere and just start from the most recent structure.

Just some suggestions but this is clearly a tough problem, but from experience, this is the thing that turns some defects people off of using Atomate (since they are all running HSE supercell calculations) so super interested in fixing this.

mkhorton commented 2 years ago

I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.

I think this is an important point too -- e.g., if there's a relaxation that does not converge, does this still get entered into a database? There are both pros/cons to having unconverged (or even failed) entries in a database but it makes the scheme and builds more complicated. I know we have partial support for this already. The alternative with an explicit checkpointed state is that these jobs do not get parsed.

mkhorton commented 2 years ago

This also gets complicated by the fact that (I assume) most "restarts" will be initialized from lpad rerun_wfs so maybe we just need a restart flag like ISTART in VASP that dictates how the calculation should behave given different available data in the directory.

We're looking for a jobflow-based solution first and foremost, but for the discussion of how this integrates with FireWorks, I would suggest we would want a different command than just rerun_fws, e.g. we would want continuations to happen automatically without explicit user intervention.

mkhorton commented 2 years ago

this is the thing that turns some defects people off of using Atomate (since they are all running HSE supercell calculations)

In this case, this would be a continuation that could not use the contcar, correct? e.g. it would specifically need to checkpoint via stopcar mid-electronic relaxation. Therefore we encounter issues described in my point 5. above?

jmmshn commented 2 years ago

In this case, this would be a continuation that could not use the contcar, correct?

So in my experience calculations rarely timeout without any ionic step. Defect calculations are really bad because they:

Need HSE -> which sometimes requires many many electronic steps to get the first converged wavefunction
Needs to take many ionic steps because the ionic relaxation tends to be pretty extreme

On most reasonable clusters you'll actually get quite a few ionic steps in before walltime.

davidwaroquiers commented 2 years ago

Yeah, I think we should have a quick chat about this as a group next week?

Hi @mkhorton @utf @jmmshn @mjwen I'd be happy to participate to the discussion about this. If that's possible, do not hesitate to contact me by mail to setup a meeting.

David

utf commented 1 month ago

I wanted to bring this up again. Is this something you have thought about with jobflow-remote @gpetretto @davidwaroquiers?

With JFR, we know the amount of time left on the job, so we could implement some of @mkhorton ideas regarding wall time. One option would be to add another option to the @job decorator like @job(checkpoint=vasp_checkpoint_function), that could be responsible for parsing the partially complete outputs, and then resubmitting a replace job which continues the relaxation and stiches together the outputs at the end.

This would be a JFR specific feature but I don't think that should be a blocker.

gpetretto commented 1 month ago

Thanks @utf for bringing up the topic again. After discussing with @davidwaroquiers, here is what we came up with.

Given the typical use cases for jobflow's workflows, I would mainly consider 3 scenarios. I will take the example of DFT calculations, but I suppose this could also be considered more general.

1) The calculation finishes on its own within the time limit, but it is not completed. (In DFT for example SCF or relaxation was not converged) 2) The calculation will not finish on its own within the allocated time and is stopped by the Job. We can distinguish two subcases here, but the main point is that since the Job is still running it has the time to return a Response after the main task has been interrupted. a) The calculation may be stopped nicely, but it is not completed (e.g. create STOPCAR) b) The calculation cannot be stopped nicely and should be killed (whether because the time limit is too close or because the execution does not support). 3) The Job does not terminate as it hits the walltime and is thus killed by the queue manager.

Let us know if you think that more use cases should be considered.

Our view is that point 1 and 2 would not really need an addition of a checkpoint mechanism, nor it should require to be limited to jobflow-remote. The logic can be implemented in the Job itself after the calculation is stopped. The replace option in the Respose is the most suitable to deal with this. We can work on finding a way of standardizing the definition of the procedure, but the logic to generate a new step will still be dependent on the content of the Job and the kind of calculation it performs. A way of standardizing this could be to add an optional restart/replace method to the Maker that generated the Job. The Maker is available at runtime inside the Job, so it could be used to tailor the restart of the Jobs itself. Possibly using the outputs that are available in the running folder.

We are already handling cases 1 and 2/a for the Abinit workflows in atomate2 in a very similar way: https://github.com/materialsproject/atomate2/blob/fb9a6e80e0b0bfbdcdffd35a62063b49cfb9ee19/src/atomate2/abinit/jobs/base.py#L203. This seems the easiest way of handling most of the cases of unconverged calculations.

One point that you raised, and that would be relevant for case 2, is how to signal the job that the walltime is approching. You proposed to rely on jobflow-remote to check the state and send signals. While this should be technically feasible, I believe that such a solution could introduce complications and points of failures that are not strictly needed. Sending a signal from jobflow-remote's Runner to the job may not always be trivial. The connection between the runner and the worker could be down or the Runner could even be stopped, thus missing the time to send the signal. From our experience getting the remaining time from inside the job itself is relatively easy and would avoid all the complications and pitfalls of communications between the Runner and the Job. In the past we used the function below to get the end time of the job that is being executed to handle the approaching walltime.

def get_end_time_slurm(job_id):
    """
    Extracts the end time as seconds from the epoch in SLURM.

    Args:
        job_id (str): the job id.

    Returns:
        float: the seconds from the epoch corresponding to the end time of the job.
    """
    command = shlex.split("scontrol show job {}".format(job_id))
    p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, encoding='utf-8')
    out, err = p.communicate()

    match = re.search(r"EndTime=(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})", out)
    if match:
        end = datetime.datetime.strptime(match.group(1), "%Y-%m-%dT%H:%M:%S")
        return end.timestamp()
    else:
        return None

with job_id = os.getenv("SLURM_JOB_ID"). Knowing the end time of the job is preferable than the walltime, since in a multi launch execution a jobflow Job may not be started at the beginning of the SLURM/PBS job. An additional advantage would be that each job could decide how much buffer leave before trying to stop the job, instead of relying on a standard value from an external entity that sends the signal. I have a similar function for PBS and I suppose that equivalent functions could be written for other queue managers as well. Getting the remaining time at the beginning of the Job allows it to plan for its execution (e.g. for VASP, setting the value for the WalltimeHandler). If you think this would not be reliable, an alternative would be to let jobflow-remote pass the walltime to the Job based on the selected resources. This will of course limit the usage of such a feature to jobflow-remote. In any case, the procedure could be standardized and we could try to always offer the "remaining time" as an information inside the current Job. In general I see that working with signals will allow a wider variety of actions, but it would seem that most of the issues could be solved by knowing the walltime in advance, and I am not sure if that would be worth the additional effort.

Point 3 of the initial list cannot of course be handled by jobflow itself. In that case an option would be to implement the checkpoint mechanism that you propose through a new option of the job decorator (or alternatively again through a method of the Maker). Jobflow-remote would be able to recognize that the job was interrupted, and, if the checkpoint option is defined, can enter a CHECKPOINT state to perform a "post-mortem" analysis based on the function. (This may require downloading some of the output files). From there it could generate a replace for the job, similarly to what could be done for the previous points. From a point of view of the analysis of the outputs, this would be similar to what needs to be performed for the case 2/b.

I don't know if all this will fit your ideas, but I hope this will help move forward the discussion.

davidwaroquiers commented 2 weeks ago

Thanks @gpetretto for gathering these use cases and options. Any thoughts about this @utf @mkhorton @utf @jmmshn @mjwen @janosh ?

materialsproject / atomate2

Standardized approach for job checkpoint/restart/continuation #156