ucgmsim / slurm_gm_workflow

Porting the GM workflow to run on new NeSI HPC (Maintainer: Jonney)
MIT License
0 stars 2 forks source link

Collect estimated time and time actually spent for better estimation #521

Open sungeunbae opened 3 months ago

sungeunbae commented 3 months ago

I want to add this script to estimation tool kit in the workflow. The current wall-clock estimation is not very accurate, and makes it difficult to estimate the total core hours to run a Cybershake-styled large set of simulations. We also don't collect actual CPU time spent, and losing the crucial information of resource usage as well as the opportunity to improve the estimation quality.

This script will interact with SLURM (using sacct command, -n removes the header),

$ sacct -j 4088784,4088640 --format="JobID,Elapsed,TimeLimit,AllocCPUS" -n
4088640        02:10:42   08:17:00        160
4088640.bat+   02:10:42                    80
4088640.ext+   02:10:42                   160
4088640.0      02:10:06                   160
4088784        00:03:38   00:30:00         80
4088784.bat+   00:03:38                    80
4088784.ext+   00:03:38                    80
4088784.0      00:03:19                    80

This command gives more than we need(.bat+,.ext+ .0). We only need these two lines containing TimeLimit field.

4088640        02:10:42   08:17:00        160
4088784        00:03:38   00:30:00         80

Here, 02:10:42 and 00:03:38 correspond to the actual CPU time used. Wall clock time requested were 08:17:00 and 00:30:00 respectively. 160 and 80 mean the number of CPUs.

This script will consult SQLite3 DB, and collect all "completed" job ids, then use the "sacct" command to collect the estimated/actual resource usage, then create a CSV file.

Example run:

(python3_maui) baes@maui02: /nesi/nobackup/nesi00213/RunFolder/Cybershake/v24p6$ python $gmsim/workflow/workflow/automation/estimation/get_est_vs_used_cputime.py . list_crustal.txt 
Record found in DB : 4302 entries
CSV file est_vs_used_cpu_time.csv created successfully!
Total CPU hours needed for all realisations on maui : 3386491.20 hours
Total CPU hours needed for all realisations on mahuika : 5386.97 hours

The generated CSV file looks like the following. Note that it includes the cpu_seconds_need_for_all_rels column at the end, the product of cpu_seconds_used and num_rels needed for the fault/event (obtained from the list.txt)

run_name,proc_type,machine,job_id,num_cpus,time_requested,time_used,cpu_seconds_used,num_rels,cpu_seconds_need_for_all_rels
AhuririR,BB,maui,4088784,80,00:30:00,00:03:38,17440.0,30,523200.0
AhuririR,EMOD3D,maui,4088640,160,08:17:00,02:10:42,1254720.0,30,37641600.0
AhuririR,HF,maui,4088641,80,00:30:00,00:03:33,17040.0,30,511200.0
AhuririR,IM_calculation,maui,4088789,80,01:15:00,00:20:42,99360.0,30,2980800.0
AhuririR,INSTALL_FAULT,mahuika,47954154,2,00:15:00,00:00:29,58.0,0,0.0
AhuririR,INSTALL_REALISATION,mahuika,47954072,2,00:15:00,00:00:28,56.0,30,1680.0
AhuririR,SRF_GEN,mahuika,47954054,32,01:00:00,00:00:59,1888.0,30,56640.0
AhuririR,VM_GEN,mahuika,47954077,32,01:00:00,00:04:31,8672.0,0,0.0
AhuririR,VM_PARAMS,mahuika,47954053,2,00:15:00,00:01:04,128.0,0,0.0
...
sungeunbae commented 3 months ago

I think just taking a quick look we already have something for this https://github.com/ucgmsim/slurm_gm_workflow/blob/master/workflow/automation/metadata/collect_metadata.py But I don't think it has all of the data that's collected here, might be good to merge though instead of having 2 different metadata collection scripts which could cause some confusion. The other collect metadata script does gain extra metadata that can be used to re-estimate estimation better than what we just have here.

As a reminder the 30 minutes is also a min allowed wall clock time to set

Good point. There were a few core differences, which prompted me to write a new script, instead of fixing the existing solution. (eg. when I have no sim_params.yaml for all relisations, it will break) but I can see your rationale.

A few key differences include

  1. my script allows to run median event only, and extrapolate the estimation for the entire rels.
  2. it directly interact with SLURM, instead of relying on slurm_mgmt.db. If some jobs were managed outside the run_cybershake due to some issues and needed manual intervention, the DB may not necessarily have the best up-to-date data.

I agree there are some commonalities between two, and it may seem logical to merge or let them share some code. I'm happy to withdraw this PR, but let's keep this branch (for future reference). I don't want to lose some SLURM techniques I learned while writing this code :)