Open sungeunbae opened 3 months ago
I think just taking a quick look we already have something for this https://github.com/ucgmsim/slurm_gm_workflow/blob/master/workflow/automation/metadata/collect_metadata.py But I don't think it has all of the data that's collected here, might be good to merge though instead of having 2 different metadata collection scripts which could cause some confusion. The other collect metadata script does gain extra metadata that can be used to re-estimate estimation better than what we just have here.
As a reminder the 30 minutes is also a min allowed wall clock time to set
Good point. There were a few core differences, which prompted me to write a new script, instead of fixing the existing solution. (eg. when I have no sim_params.yaml for all relisations, it will break) but I can see your rationale.
A few key differences include
I agree there are some commonalities between two, and it may seem logical to merge or let them share some code. I'm happy to withdraw this PR, but let's keep this branch (for future reference). I don't want to lose some SLURM techniques I learned while writing this code :)
I want to add this script to estimation tool kit in the workflow. The current wall-clock estimation is not very accurate, and makes it difficult to estimate the total core hours to run a Cybershake-styled large set of simulations. We also don't collect actual CPU time spent, and losing the crucial information of resource usage as well as the opportunity to improve the estimation quality.
This script will interact with SLURM (using sacct command,
-n
removes the header),This command gives more than we need(.bat+,.ext+ .0). We only need these two lines containing TimeLimit field.
Here, 02:10:42 and 00:03:38 correspond to the actual CPU time used. Wall clock time requested were 08:17:00 and 00:30:00 respectively. 160 and 80 mean the number of CPUs.
This script will consult SQLite3 DB, and collect all "completed" job ids, then use the "sacct" command to collect the estimated/actual resource usage, then create a CSV file.
Example run:
The generated CSV file looks like the following. Note that it includes the
cpu_seconds_need_for_all_rels
column at the end, the product ofcpu_seconds_used
andnum_rels
needed for the fault/event (obtained from the list.txt)