Closed DininduSenanayake closed 3 days ago
Job failed as the requested soil was not present in the soil library.
Need either: A way to extract soil names from soil library, or a function that checks that the soil exists in the library
or:
some kind of fail checking that still continues loop over config files that work. e.g., if fail, move config file to FAILED --> Loop index +1 and continue, and some kind of fail safe where if failed 10 times in a row, break loop and write to log.
@Ollehar Looks like "skip and hop" script I have compiled is working as intended. .
FAILED
directory ( following output)❯ ls FAILED/
Oturehua_14a2_19487ConfigFile.txt Oturehua_14a2_20993ConfigFile.txt Oturehua_14a2_28426ConfigFile.txt Oturehua_14a2_29813ConfigFile.txt
Oturehua_14a2_20661ConfigFile.txt Oturehua_14a2_28276ConfigFile.txt Oturehua_14a2_28720ConfigFile.txt Oturehua_14a2_9938ConfigFile.txt
nfigFile.cs:line 311
at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 201
--- End of inner exception stack trace ---
at Models.Core.ConfigFile.ConfigFile.RunConfigCommands(Simulations tempSim, String command, String configFileDirectory) in /tmp/ApsimX/Models/Core/ConfigFile/ConfigFile.cs:line 220
at Models.Program.ExecuteCommands(Options options, String configFileDirectory, List`1 commandsList, ApplyRunManager& applyRunManager, DataRow row) in /tmp/ApsimX/Models/Main.cs:line 425
at Models.Program.DoCommands(Options options, String[] files, String configFileDirectory, List`1 commandsList) in /tmp/ApsimX/Models/Main.cs:line 313
at Models.Program.Run(Options options) in /tmp/ApsimX/Models/Main.cs:line 175
Failed to process Oturehua_14a2_9938ConfigFile.txt
Successfully processed Oturehua_15a1_13223ConfigFile.txt
Successfully processed Oturehua_15a1_19487ConfigFile.txt
Successfully processed Oturehua_15a1_20661ConfigFile.txt
Successfully processed Oturehua_15a1_20993ConfigFile.txt
Successfully processed Oturehua_15a1_28276ConfigFile.txt
Successfully processed Oturehua_15a1_28426ConfigFile.txt
Successfully processed Oturehua_15a1_28720ConfigFile.txt
Successfully processed Oturehua_15a1_29813ConfigFile.txt
Successfully processed Oturehua_15a1_9938
#!/bin/bash
#SBATCH --job-name=apsim_models
#SBATCH --output=slurmlogs/%j.out
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=16:00:00
module load Apptainer
export APPTAINER_BIND="/agr/scratch,/agr/persist"
export APPTAINER_CMD="apptainer exec /agr/persist/projects/2024_apsim_improvements/apsim-simulations/container/apsim-2024.09.7579.0.aimg"
# Create FAILED directory if it doesn't exist
mkdir -p FAILED
consecutive_failures=0
max_consecutive_failures=10
# Function to process a file
process_file() {
local file="$1"
if ${APPTAINER_CMD} Models --cpu-count ${SLURM_CPUS_PER_TASK} --apply "$file"; then
echo "Successfully processed $file"
return 0
else
echo "Failed to process $file"
return 1
fi
}
# Run command for all .txt files, excluding ExampleConfig.txt
for file in *.txt; do
if [ -f "$file" ] && [ "$file" != "ExampleConfig.txt" ]; then
if process_file "$file"; then
consecutive_failures=0
else
mv "$file" FAILED/
((consecutive_failures++))
if [ $consecutive_failures -ge $max_consecutive_failures ]; then
echo "Error: $max_consecutive_failures consecutive failures reached. Terminating job." >&2
exit 1
fi
fi
fi
done
I have to take away -e
. This adds the risk of shell not existing immediately if any command exits with a non-zero status but we can consider this to be a low risk scenario as it is a serial for
loop and we do have the progress being recorded in standard out.
❯ sacct
JobID JobName User Start Elapsed AveCPU MinCPU TotalCPU Alloc NTask ReqMem MaxRSS State NodeList
-------------- --------------- --------- ------------ ---------- ---------- ---------- ---------- ----- ----- ------- ---------- ---------- ------------------------------
987828 apsim_models dinindu Sep 13 16:22 10:50:15 13:22:39 4 8G COMPLETED compute-0
987828.batch batch Sep 13 16:22 10:50:15 04:21:09 04:21:09 13:22:39 4 1 3333440K COMPLETED compute-0
❯ ls -1 *.txt | tail
Selwyn_52a1_28720ConfigFile.txt
Selwyn_52a1_29813ConfigFile.txt
Selwyn_52a1_9938ConfigFile.txt
Selwyn_52a2_13223ConfigFile.txt
Selwyn_52a2_19487ConfigFile.txt
Selwyn_52a2_20661ConfigFile.txt
Selwyn_52a2_20993ConfigFile.txt
Selwyn_52a2_28276ConfigFile.txt
Selwyn_52a2_28426ConfigFile.txt
Selwyn_52a2_28720ConfigFile.txt
and the standard output confirms it
❯ tail slurmlogs/987828.out
Successfully processed Selwyn_52a1_29813ConfigFile.txt
Successfully processed Selwyn_52a1_9938ConfigFile.txt
Successfully processed Selwyn_52a2_13223ConfigFile.txt
Successfully processed Selwyn_52a2_19487ConfigFile.txt
Successfully processed Selwyn_52a2_20661ConfigFile.txt
Successfully processed Selwyn_52a2_20993ConfigFile.txt
Successfully processed Selwyn_52a2_28276ConfigFile.txt
Successfully processed Selwyn_52a2_28426ConfigFile.txt
Successfully processed Selwyn_52a2_28720ConfigFile.txt
Processing completed.
Fix was merged to main in https://github.com/DininduSenanayake/APSIM-eri-mahuika/pull/36
Restart protocol was introduced in https://github.com/DininduSenanayake/APSIM-eri-mahuika/pull/49
There can be instances where the
for
loop in 4-create-apsimx-files can failed after processing few.txt files. Below is an example of one of those failures for two directories with few thousand files where the failures had occurred after ~50% mark per folderset-1
set-3
set-4
I do not think re-running the
for
loop will ignore the already processed files ( no native check pointing forModels --apply
)Therefore,
for
loop without killing the whole job after "one" single failure. Perhaps we can let it continueif the number of consecutve failures is < 10
FAILED