Batch timeseries analysis in slurm

ledm commented 7 months ago

Closes #118

It's looking like this script is working now.

This PR adds a slurm queue based batch parallel processing of single job timeseries tool.

It's got the following features:

Loads from existing input_yaml files.
Run with a single command
Fully parallelises the slowest part of the analysis_comparison tool: the single job analysis.
Won't submit the same jobID if it's already existing*
However, if two analyses call the same jobID with different suites (ie one has bgc and one has physics), it will only run the first one.

Need to do:

Documentation in README.md
Is it @DrYool proof?

valeriupredoi commented 7 months ago

oh and also - this things really do deserve a test, not thru and thru with SLURM submission, but everything up to that. I can write the test when it's about ready :+1:

ledm commented 7 months ago

let's make sure the piping is done correctly; also, have you tried this in practice? We don't need any special env to pass to sbatch do we? Like any special environment variables

I've been using this for a few days and it works on jasmin. Your amendment to the subprocess also works too.

If batch_timeseries fails you get normal python errors. If it fails inside inside the sbatch script, then you get error messages in the places that we tell it to fail.

valeriupredoi commented 7 months ago

yeh that's how we want it to behave, so stdout can be piped to eg a file. Looks good, bud! Let me write a test for it!

ledm commented 7 months ago

I'm not ready to merge. Still need to add documentation & maybe get @DrYool to try it.

ledm commented 7 months ago

The next question I have:

Do we want to make this the default behaviour? This would mean that we run this script from inside analysis_compare when the -s flag is absent instead of calling the analysis_timeseries command?

ledm commented 7 months ago

Basically, the process for adding a new job, input.yml:

analysis_compare -s -y input.yml: This generates the job download commands, which will run overnight. It also creates an html report, but it breaks if there's no data downloaded yet.
Wait overnight for data to download on mass.
batch_timeseries -y input.yml: This submits the job timeseries onto the processing nodes.
analysis_compare -s -y input.yml: This job creates the html report.
./rsync_to_esmeval.sh: this copies to html to the web visible location on disk.

I suspect that these can be merged into fewer commands!

valeriupredoi commented 7 months ago

@ledm I added some test gubbins, make sure to pull or merge so there are no conflictseses

ledm commented 7 months ago

The logo in the README points towards a file on the main branch, but of course it's not available yet until this PR is merged.

ledm commented 7 months ago

Okay @valeriupredoi, I'm happy with this now.

valeriupredoi commented 7 months ago

all good by me too, bud! Go ahead and merge when you good 🍺

valeriupredoi / bgcval2

Batch timeseries analysis in slurm #120