nesi / APSIM-HPC

Deploy APSIM (Agricultural Production Systems sIMulator - https://www.apsim.info/) on high performance computing clusters.
MIT License
0 stars 0 forks source link

4-create-apsimx-files: Allow the for loop to continue for 10 ( arbitrary number) consecutive input failures while moving the failed .txt files to a separate directory AND implement a reliable restart method #35

Closed DininduSenanayake closed 3 days ago

DininduSenanayake commented 2 weeks ago

There can be instances where the for loop in 4-create-apsimx-files can failed after processing few.txt files. Below is an example of one of those failures for two directories with few thousand files where the failures had occurred after ~50% mark per folder

set-1


❯ cat 986863.out 
System.Exception: An error occurred while running config file commands.
 ---> System.Exception: Object reference not set to an instance of an object. : Add [Block] [Fereday_4b6]
   at Models.Core.ConfigFile.ConfigFile.RunInstructionOnApsimxFile(IModel simulations, Instruction instruction, String pathOfSimWithNode, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 311
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 201
   --- End of inner exception stack trace ---
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 220
   at Models.Program.ExecuteCommands(Options options, String configFileDirectory, List`1 commandsList, ApplyRunManager& applyRunManager, DataRow row) in /tmp/ApsimX/Models/Main.cs:line 425
   at Models.Program.DoCommands(Options options, String[] files, String configFileDirectory, List`1 commandsList) in /tmp/ApsimX/Models/Main.cs:line 313
   at Models.Program.Run(Options options) in /tmp/ApsimX/Models/Main.cs:line 175

set-3

❯ cat 986891.out
System.Exception: An error occurred while running config file commands.
 ---> System.Exception: Object reference not set to an instance of an object. : Add [Block] [Oturehua_14a2]
   at Models.Core.ConfigFile.ConfigFile.RunInstructionOnApsimxFile(IModel simulations, Instruction instruction, String pathOfSimWithNode, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 311
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 201
   --- End of inner exception stack trace ---
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 220
   at Models.Program.ExecuteCommands(Options options, String configFileDirectory, List`1 commandsList, ApplyRunManager& applyRunManager, DataRow row) in /tmp/ApsimX/Models/Main.cs:line 425
   at Models.Program.DoCommands(Options options, String[] files, String configFileDirectory, List`1 commandsList) in /tmp/ApsimX/Models/Main.cs:line 313
   at Models.Program.Run(Options options) in /tmp/ApsimX/Models/Main.cs:line 175

set-4

❯ cat 986892.out 
System.Exception: An error occurred while running config file commands.
 ---> System.Exception: Object reference not set to an instance of an object. : Add [Block] [TeRangiita_9a1]
   at Models.Core.ConfigFile.ConfigFile.RunInstructionOnApsimxFile(IModel simulations, Instruction instruction, String pathOfSimWithNode, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 311
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 201
   --- End of inner exception stack trace ---
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 220
   at Models.Program.ExecuteCommands(Options options, String configFileDirectory, List`1 commandsList, ApplyRunManager& applyRunManager, DataRow row) in /tmp/ApsimX/Models/Main.cs:line 425
   at Models.Program.DoCommands(Options options, String[] files, String configFileDirectory, List`1 commandsList) in /tmp/ApsimX/Models/Main.cs:line 313
   at Models.Program.Run(Options options) in /tmp/ApsimX/Models/Main.cs:line 175

I do not think re-running the for loop will ignore the already processed files ( no native check pointing for Models --apply)

Therefore,

  1. First we should allow it to carry on with the for loop without killing the whole job after "one" single failure. Perhaps we can let it continue if the number of consecutve failures is < 10
  2. Also, we should move the failed files to a separate directory, let's name it FAILED
  3. Then we need a reliable way to restart the loop
Ollehar commented 2 weeks ago

Job failed as the requested soil was not present in the soil library.

Need either: A way to extract soil names from soil library, or a function that checks that the soil exists in the library

or:

some kind of fail checking that still continues loop over config files that work. e.g., if fail, move config file to FAILED --> Loop index +1 and continue, and some kind of fail safe where if failed 10 times in a row, break loop and write to log.

DininduSenanayake commented 1 week ago

@Ollehar Looks like "skip and hop" script I have compiled is working as intended. .

❯ ls FAILED/
Oturehua_14a2_19487ConfigFile.txt  Oturehua_14a2_20993ConfigFile.txt  Oturehua_14a2_28426ConfigFile.txt  Oturehua_14a2_29813ConfigFile.txt
Oturehua_14a2_20661ConfigFile.txt  Oturehua_14a2_28276ConfigFile.txt  Oturehua_14a2_28720ConfigFile.txt  Oturehua_14a2_9938ConfigFile.txt
nfigFile.cs:line 311
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 201
   --- End of inner exception stack trace ---
   at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 220
   at Models.Program.ExecuteCommands(Options options, String configFileDirectory, List`1 commandsList, ApplyRunManager& applyRunManager, DataRow row) in /tmp/ApsimX/Models/Main.cs:line 425
   at Models.Program.DoCommands(Options options, String[] files, String configFileDirectory, List`1 commandsList) in /tmp/ApsimX/Models/Main.cs:line 313
   at Models.Program.Run(Options options) in /tmp/ApsimX/Models/Main.cs:line 175
Failed to process Oturehua_14a2_9938ConfigFile.txt
Successfully processed Oturehua_15a1_13223ConfigFile.txt
Successfully processed Oturehua_15a1_19487ConfigFile.txt
Successfully processed Oturehua_15a1_20661ConfigFile.txt
Successfully processed Oturehua_15a1_20993ConfigFile.txt
Successfully processed Oturehua_15a1_28276ConfigFile.txt
Successfully processed Oturehua_15a1_28426ConfigFile.txt
Successfully processed Oturehua_15a1_28720ConfigFile.txt
Successfully processed Oturehua_15a1_29813ConfigFile.txt
Successfully processed Oturehua_15a1_9938

Script

#!/bin/bash

#SBATCH --job-name=apsim_models
#SBATCH --output=slurmlogs/%j.out
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=16:00:00

module load Apptainer
export APPTAINER_BIND="/agr/scratch,/agr/persist"
export APPTAINER_CMD="apptainer exec /agr/persist/projects/2024_apsim_improvements/apsim-simulations/container/apsim-2024.09.7579.0.aimg"

# Create FAILED directory if it doesn't exist
mkdir -p FAILED

consecutive_failures=0
max_consecutive_failures=10

# Function to process a file
process_file() {
    local file="$1"
    if ${APPTAINER_CMD} Models --cpu-count ${SLURM_CPUS_PER_TASK} --apply "$file"; then
        echo "Successfully processed $file"
        return 0
    else
        echo "Failed to process $file"
        return 1
    fi
}

# Run command for all .txt files, excluding ExampleConfig.txt
for file in *.txt; do
    if [ -f "$file" ] && [ "$file" != "ExampleConfig.txt" ]; then
        if process_file "$file"; then
            consecutive_failures=0
        else
            mv "$file" FAILED/
            ((consecutive_failures++))

            if [ $consecutive_failures -ge $max_consecutive_failures ]; then
                echo "Error: $max_consecutive_failures consecutive failures reached. Terminating job." >&2
                exit 1
            fi
        fi
    fi
done

Risks

I have to take away -e . This adds the risk of shell not existing immediately if any command exits with a non-zero status but we can consider this to be a low risk scenario as it is a serial for loop and we do have the progress being recorded in standard out.

Confirming the outcome

❯ sacct
JobID          JobName         User             Start    Elapsed     AveCPU     MinCPU   TotalCPU Alloc NTask  ReqMem     MaxRSS State      NodeList                       
-------------- --------------- --------- ------------ ---------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ------------------------------ 
987828         apsim_models    dinindu   Sep 13 16:22   10:50:15                         13:22:39     4            8G            COMPLETED  compute-0                      
987828.batch   batch                     Sep 13 16:22   10:50:15   04:21:09   04:21:09   13:22:39     4     1           3333440K COMPLETED  compute-0  

and the standard output confirms it

❯ tail slurmlogs/987828.out 
Successfully processed Selwyn_52a1_29813ConfigFile.txt
Successfully processed Selwyn_52a1_9938ConfigFile.txt
Successfully processed Selwyn_52a2_13223ConfigFile.txt
Successfully processed Selwyn_52a2_19487ConfigFile.txt
Successfully processed Selwyn_52a2_20661ConfigFile.txt
Successfully processed Selwyn_52a2_20993ConfigFile.txt
Successfully processed Selwyn_52a2_28276ConfigFile.txt
Successfully processed Selwyn_52a2_28426ConfigFile.txt
Successfully processed Selwyn_52a2_28720ConfigFile.txt
Processing completed.
DininduSenanayake commented 1 week ago

Fix was merged to main in https://github.com/DininduSenanayake/APSIM-eri-mahuika/pull/36

DininduSenanayake commented 3 days ago

Restart protocol was introduced in https://github.com/DininduSenanayake/APSIM-eri-mahuika/pull/49