nesi / APSIM-HPC

Deploy APSIM (Agricultural Production Systems sIMulator - https://www.apsim.info/) on high performance computing clusters.
MIT License
0 stars 0 forks source link

reliable & clean "restart" method for jobs which "timed out" while creating .apsimx ( and .db placeholder) files. #48

Closed DininduSenanayake closed 1 week ago

DininduSenanayake commented 1 week ago

There can be instances where the for loop which generates the .apsimx and .db placeholder file (https://github.com/DininduSenanayake/APSIM-eri-mahuika/blob/main/5-create-apsimx-files/create_apsimx_skip_failed.sl) might go over the time limit. .i.e. Predicted time limit can vary during runtime because of the overhead, I/O slowdowns,etc.

Therefore, it will be ideal to have a clean+reliable method to restart these timed out jobs and only process the remaining Config.txt files without having re-running the everything. Models --apply itself doesn't appear have a built-in function to do this.

DininduSenanayake commented 1 week ago

We have a working solution for this https://github.com/DininduSenanayake/APSIM-eri-mahuika/pull/49

989855         apsim_models    dinindu   Sep 19 11:44   00:57:13                         01:09:35     4            8G            COMPLETED  compute-3                      
989855.batch   batch                     Sep 19 11:44   00:57:13   00:23:39   00:23:39   01:09:35     4     1           3196592K COMPLETED  compute-3 

All checks out based on numbers ✅

Total number of .txt files in `set-1`                       = 2612

Total number of "legal" entries in timed out standard .out  = 2450

Total number of legal entries in restart script             =  162

*"legal" = Remove the lines which are related to failed ones