libAtoms / workflow

python workflow toolkit
GNU General Public License v2.0
36 stars 18 forks source link

Recover after error #257

Open Felixrccs opened 1 year ago

Felixrccs commented 1 year ago

If I want to calculate large amounts of DFT calculations over wfl on the cluster I ran into the problem, that I reached the wall-time and the job failed. In that case I have to go through the calculation directories and recover the calculations manually.

Would it be possible to add a feature where I can give wfl an internal wall-time and it stops after i.e. 20 h returns me one xyz with successful and another xyz with failed/ non-finished calculations?

bernstei commented 1 year ago

I'm willing to think about how to deal with this, but my first instinct is that this will be surprisingly complex to implement.

Two xyz are basically out of the question - I think the best we'll be able to do is a single output with something like an Atoms.info label for the successfully completed configurations, but presumably that's enough that you can split the file into two yourself.

Another issue is that it'd be somewhat complex to set up a system that will stop in the middle of a calculation (you'd have to run the calculation in a separate process). If we don't, you'd need to know at calculation N that doing the N+1st will take too long, and give up before starting it. If you can make that kind of estimate, you can probably also just set up your jobs to not exceed your walltime anyway, I suspect. Again, doable, but not trivial.

Can you break up your job into many jobs (num_inputs_per_queued_job)? If you do that, then just rerunning the script with a larger walltime should just work, skip any jobs that were completed successfully, and just ask for more time when it resubmits the timed-out jobs.

Maybe you should describe your use case more fully (number of configs, sizes of cells, heterogeneity of sizes, range of times per calculation, etc) to see if there's a way of dealing with it without adding this complexity.

bernstei commented 1 year ago

With some limitations, it may actually be more straightforward than I first thought. The actual timeout mechanism for the calculation is, I think (need to test the details), pretty straightforward with multiprocessing.pool.imap, which autoparallelize uses. My current challenge is that there's currently no access to the input configurations when looping over the outputs in autoparallelize, so no easy way to return an annotated input config when the timeout happens.

Felixrccs commented 1 year ago

Maybe you should describe your use case more fully (number of configs, sizes of cells, heterogeneity of sizes, range of times per calculation, etc) to see if there's a way of dealing with it without adding this complexity.

So I have to recalculate my trainingset with tighter DFT parameters (about 1000 configs). I'm looking at metal-oxid surfaces that have between 80 and 200 atoms. The DFT single point take normally between 1 and 3 hours per single point on one node. I calculate 10 subproccesses on ten nodes, 50 structures take about 15 hours of wall time in total. However from time to time there are a few amorphous structures that need up 10 hours for a single point. Our cluster has a 24 hour submission limit and in one case every DFT was finished except for one complex config.

I solved it for the my case by parallelizing the single points within VASP over multiple nodes.

But I think this would probably still be a good feature regarding user-friendliness.

My current challenge is that there's currently no access to the input configurations when looping over the outputs in autoparallelize, so no easy way to return an annotated input config when the timeout happens.

I will think about this myself. If you have something to test, I can try it out.

bernstei commented 1 year ago

So by far the cleanest solution to this is to use the queuing system for what it's designed for, i.e. to run multiple jobs side by side or in sequence as resources become available, and the easiest way to do that is one queued job per configuration.

Is that an option (perhaps we can make the time and/or number of nodes settable on a per-config basis), or do you have a problem such as a maliciously-configured queuing system with a small limit on the number of submitted jobs (some UK HPC - archer, I think) ?

Felixrccs commented 1 year ago

No I can split it up in one job per DFT (mpg HPC queing systeme is performing fine).

bernstei commented 1 year ago

My current challenge is that there's currently no access to the input configurations when looping over the outputs in autoparallelize, so no easy way to return an annotated input config when the timeout happens.

I think this may actually be pretty straightforward. I'll test my idea it once I finish testing the kspacing thing.

bernstei commented 1 year ago

No I can split it up in one job per DFT (mpg HPC queing systeme is performing fine).

We should maybe in general think more carefully about how we're handling failed jobs, etc, because as it stands now, even if you did this if you wanted nice output you'd have to rerun the script with a larger walltime limit - there'd be no clean way to just give up on the very slow job.

bernstei commented 1 year ago

I think this may actually be pretty straightforward. I'll test my idea it once I finish testing the kspacing thing.

OK - on further thought, depending on what kind of timeouts you want to deal with, this could be quite messy. I think this needs a deep discussion first, and going through the possible use cases very carefully. I'm happy to do it, but I won't delay my other PR (the DFT calculator kspacing and non-periodic cell update) for it.

Also, I have a question:

I calculate 10 subproccesses on ten nodes

What exactly do you mean by this statement? Are you trying to do pool-based parallelism with 10 subprocesses in a job that uses 10 nodes? Can you describe more precisely how you are setting up the wfl script that's doing the database re-evaluation to try to achieve this? I ask because the normal multiprocessing.pool based autoparallelism doesn't work on multiple nodes.

Felixrccs commented 1 year ago

Running wfl.calculators.generic._run_autopara_wrappable with num_python_subprocesses=10, um_inputs_per_python_subprocess=1 I think it works because in the end ase just calls VASP with a command like srun -N 1 --ntasks=72 bin/vasp_std. So all the ase-vasp parcers and waiting proccesses probably run on one node (which should not matter since they are not expensive) and slurm in combination with VASP does the rest of the parallelization (and enables me to use multiple nodes).

So in my case the multiprocessing pool is only sending VASP run commands , waiting and retriving the results in an efficents way.

bernstei commented 1 year ago

So in my case the multiprocessing pool is only sending VASP run commands , waiting and retriving the results in an efficents way.

Well, it'll try to run 10 at the same time, so the question is, if you have a 10 node allocation (which was my understanding), and you run 10 srun -N 1 side by side on the allocation's head node, will they all end up running on a single node and waste the other 9, or will they end up each on a different node. The latter is possible, but you may want to check to make sure you're not wasting 9 nodes.

[added: I'm fairly certain that if you used plain mpirun and not slurm, they would all try to run on the allocation's head node.]

Felixrccs commented 1 year ago

The latter is possible, but you may want to check to make sure you're not wasting 9 nodes.

No, I checked that: slurm gives me a 96-99% CPU utilization back and I also did a benchmark to check if I get the wanted time reduction. It also works if I do 2 subprocesses with srun -N 4 on 8 nodes. If I give the job fewer nodes than I have wfl subprocess slurm complains that it cannot allocate the computational resources.

So Im confident that everything works as intended.

bernstei commented 1 year ago

Weird - definitely doesn't seem to work on our slurm system. If I request an 8 node allocation and run 8 srun -N 1 --ntasks... they all run on the same node. Maybe something about how slurm is configured?