nesi / APSIM-HPC

Deploy APSIM (Agricultural Production Systems sIMulator - https://www.apsim.info/) on high performance computing clusters.
https://nesi.github.io/APSIM-HPC/
MIT License
0 stars 0 forks source link

Introduce better error handling for failed jobs as Slurm state was marked as "COMPLETED" for failed jobs #13

Closed DininduSenanayake closed 2 months ago

DininduSenanayake commented 2 months ago

an example

$ sacct -j 897893_8
JobID          JobName         User             Start    Elapsed     AveCPU     MinCPU   TotalCPU Alloc NTask  ReqMem     MaxRSS State      NodeList                       
-------------- --------------- --------- ------------ ---------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ------------------------------ 
897893_8       apsim_models    dinindu   Aug 26 15:19   00:00:33                        00:26.028     4            8G            COMPLETED  compute-3                      
897893_8.batch batch                     Aug 26 15:19   00:00:33   00:00:24   00:00:24  00:26.028     4     1           2647072K COMPLETED  compute-3 

Unfortunately, this is the incorrect state

$ cat 897893_8.out
System.Exception: An error occured trying to save a simulation to /agr/persist/projects/2024_apsim_improvements/apsim-simulations/ConfigFiles/13223_Airfield_5a1.apsimx. System.IO.FileNotFoundException: Could not find file '/agr/persist/projects/2024_apsim_improvements/apsim-simulations/ConfigFiles/LargerExampletemp.apsimx.temp'.
File name: '/agr/persist/projects/2024_apsim_improvements/apsim-simulations/ConfigFiles/LargerExampletemp.apsimx.temp'
   at System.IO.File.Move(String sourceFileName, String destFileName, Boolean overwrite)
   at Models.Core.Simulations.Write(String currentFileName, String savePath) in /tmp/ApsimX/Models/Core/Simulations.cs:line 187
   at Models.Core.Simulations.Write(String currentFileName, String savePath) in /tmp/ApsimX/Models/Core/Simulations.cs:line 193
   at Models.Program.ExecuteCommands(Options options, String configFileDirectory, List`1 commandsList, ApplyRunManager& applyRunManager, DataRow row) in /tmp/ApsimX/Models/Main.cs:line 399
   at Models.Program.DoCommands(Options options, String[] files, String configFileDirectory, List`1 commandsList) in /tmp/ApsimX/Models/Main.cs:line 312
   at Models.Program.Run(Options options) in /tmp/ApsimX/Models/Main.cs:line 174
DininduSenanayake commented 2 months ago

Fixed in https://github.com/DininduSenanayake/APSIM-eri-mahuika/commit/1baeaf453cbb1f8216f732533fc3da6ba8c4c053