Closed SorooshMani-NOAA closed 6 months ago
@yichengt900 this is the ticket where we're going to discuss running 600 member ensemble
@SorooshMani-NOAA a handful of runs failed with this MPI error:
+ pushd /lustre/hurricanes/florence_2018_91e01ecc-c18d-4537-979b-223d37519634/setup/ensemble.dir/runs/vortex_4_variable_korobov_499
/lustre/hurricanes/florence_2018_91e01ecc-c18d-4537-979b-223d37519634/setup/ensemble.dir/runs/vortex_4_variable_korobov_499 ~/ondemand-storm-workflow/singularity/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images/solve.sif pschism_PAHM_TVD-VL 4
[sorooshmani-nhccolab2-00012-1-0023:23888] Impossible to open the file /lustre/.tmp/ompi.sorooshmani-nhccolab2-00012-1-0023.21145/pid.23888/contact.txt in write mode
[sorooshmani-nhccolab2-00012-1-0023:23888] [[65232,0],0] ORTE_ERROR_LOG: File open failure in file util/hnp_contact.c at line 91
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_segment.c at line 207
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 541
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_base.c at line 2436
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_segment.c at line 207
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 541
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_base.c at line 2436
[sorooshmani-nhccolab2-00012-1-0023:24334] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[sorooshmani-nhccolab2-00012-1-0023:24334] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00012-1-0023:24331] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
[sorooshmani-nhccolab2-00012-1-0023:24361] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[65232,1],2]
Exit code: 1
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00012-1-0023:23888] 2 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00012-1-0023:23888] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Should I rerun them in small chunks, to avoid these errors?
Should I rerun them in small chunks, to avoid these errors?
I think so
@SorooshMani-NOAA ~50 ensembles runs from the first try are still running (more than 14 hours)!! and based on check_completion
, their % have not changed in the last few hours! Does it make sense?! Should I still keep them running or cancel them?
Only 32 runs (out of 601) completed in the first round, and the majority (500+) failed with the error mentioned above.
Looks like ~ 50 runs mentioned in the previous comment are frozen, with no progress since yesterday. So I cancelled them to free ~88 nodes. Two notes from Thursday tagup:
We can definitely transfer and run them on Hera, but do we have Hera computation now?
I'm not sure about the compiler issue. It is more likely a Singularity issue, I was told at some point that running many containers at the same time might result in some getting stuck. You can write a script to run them in batches of 10 or something like that maybe
We can definitely transfer and run them on Hera, but do we have Hera computation now?
@saeed-moghimi-noaa mentioned earlier today that it should not be an issue. Please correct me if I'm wrong.
Paths on Hera:
/scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/
/scratch2/STI/coastal/Shared/singularity_images/
/scratch2/STI/coastal/Fariborz.Daneshvar/my_scripts/submit_mutli_schism.sh
Required settings for running on Hera:
SBATCH
command (i.e., --account=coastal
)--time=06:00:00
) gnu
to MODULES
and update openmpi
version to what is available on Hera (i.e., MODULES=gnu openmpi/3.1.4
instead of MODULES=openmpi/4.1.2
used on PW)gnu
and openmpi
modules before executing the bash script --bind /lustre
to --bind /scratch2
Here is example bash script file (submit_mutli_schism.sh
)with updates mentioned above to run two members:
#!/bin/bash
set -e
run_dir=$1
IMG=/scratch2/STI/coastal/Shared/singularity_images/solve.sif
SBATCH=/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/scripts/schism.sbatch
SCHISM_EXEC='pschism_PAHM_TVD-VL'
# Environment
export SINGULARITY_BINDFLAGS="--bind /scratch2"
echo $run_dir
echo "Launching runs"
SCHISM_SHARED_ENV=""
SCHISM_SHARED_ENV+="ALL"
SCHISM_SHARED_ENV+=",IMG=$IMG"
jCHISM_SHARED_ENV+=",MODULES=gnu openmpi/3.1.4"
joblist=""
for i in 3 4; do
jobid=$(
sbatch --parsable --account=coastal --time=06:00:00\
--export=$SCHISM_SHARED_ENV,SCHISM_DIR="$run_dir/setup/ensemble.dir/runs/vortex_4_variable_korobov_$i",SCHISM_EXEC=$SCHISM_EXEC \
$SBATCH
)
joblist+=":$jobid"
done
echo "Submitted ${joblist}"
Update on Runtime: I first set --time=4:00:00
, but runs did not complete in 4 hours (~85%)! so I increased it to 6 hours!
@SorooshMani-NOAA, @yichengt900 600 runs completed on Hera.
The combine_results
command failed with MemoryError
on compute node with 40 cores! (srun -N 1 -A coastal -n 40 -t 8:00:00 --pty bash
).
Here is the full message:
[2023-10-18 22:15:49,072] parsing.schism INFO : found 601 run directories with all the specified output patterns
Traceback (most recent call last):
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/bin/combine_results", line 8, in <module> sys.exit(main())
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 106, in main
combine_results(**parse_combine_results())
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 92, in combine_results parsed_data = combine_func(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1308, in convert_schism_output_files_to_adcirc_like results = combine_outputs(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1060, in combine_outputs
parsed_files = parse_schism_outputs(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 977, in parse_schism_outputs
dataset = output_class.read_directory(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 744, in read_directory
dataset = super().read_directory(directory, variables, parallel)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 663, in read_directory
ds = cls._calc_extermum(full_ds)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 683, in _calc_extermum arg_extrm_var = getattr(to_extrm_ary, cls.extermum_func)(dim='time').compute()
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1137, in compute
return new.load(**kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1111, in load
ds = self._to_temp_dataset().load(**kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataset.py", line 833, in load
evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
return compute(*data, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 621, in compute
dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 394, in collections_to_dsk
dsk = opt(dsk, keys, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/array/optimization.py", line 51, in optimize
dsk = dsk.cull(set(keys))
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/highlevelgraph.py", line 763, in cull
ret_key_deps.update(culled_deps)
MemoryError
I cannot increase no. of cores more than 40! Got this error message for 41: Unable to allocate resources: Requested node configuration is not available
As @yichengt900 suggested, I also tested salloc
on the compute node to run combine_results
manually as follow:
salloc --ntasks 120 --exclusive --qos=debug --time=00:30:00 --account=coastal
conda run python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files
But it failed with MemoryError
(see below):
[2023-10-19 17:18:06,096] parsing.schism INFO : found 601 run directories with all the specified output patterns
Traceback (most recent call last):
File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 31, in <module>
main(parser.parse_args())
File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 16, in main
output = combine_results(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 92, in combine_results
parsed_data = combine_func(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1308, in convert_schism_output_files_to_adcirc_like
results = combine_outputs(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1060, in combine_outputs
parsed_files = parse_schism_outputs(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 977, in parse_schism_outputs
dataset = output_class.read_directory(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 744, in read_directory
dataset = super().read_directory(directory, variables, parallel)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 663, in read_directory
ds = cls._calc_extermum(full_ds)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 683, in _calc_extermum
arg_extrm_var = getattr(to_extrm_ary, cls.extermum_func)(dim='time').compute()
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1137, in compute
return new.load(**kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1111, in load
ds = self._to_temp_dataset().load(**kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataset.py", line 833, in load
evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
return compute(*data, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 621, in compute
dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 394, in collections_to_dsk
dsk = opt(dsk, keys, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/array/optimization.py", line 51, in optimize
dsk = dsk.cull(set(keys))
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/highlevelgraph.py", line 763, in cull
ret_key_deps.update(culled_deps)
MemoryError
@yichengt900 @SorooshMani-NOAA, I used 8 hours for salloc
(see the full commands below)
salloc --ntasks 20 --mem=300GB --qos=batch --time=08:00:00 --account=coastal
The combine command failed with this segmentation error in less than 2 hours!:
[2023-10-23 20:07:51,760] parsing.schism INFO : found 601 run directories with all the specified output patterns
/tmp/tmp0ccf9puv: line 3: 116325 Segmentation fault (core dumped) python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files
ERROR conda.cli.main_run:execute(49): `conda run python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files` failed. (See above for error)
The second time it failed with KeyError: 'elevation'
after ~3 hours!
[2023-10-23 15:40:21,666] parsing.schism INFO : found 601 run directories with all the specified output patterns Traceback (most recent call last):
File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 31, in <module>
main(parser.parse_args())
File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 16, in main
output = combine_results(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 92, in combine_results
parsed_data = combine_func(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1308, in convert_schism_output_files_to_adcirc_like
results = combine_outputs(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1060, in combine_outputs
parsed_files = parse_schism_outputs(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 977, in parse_schism_outputs
dataset = output_class.read_directory(
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 744, in read_directory
dataset = super().read_directory(directory, variables, parallel)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 663, in read_directory
ds = cls._calc_extermum(full_ds)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 683, in _calc_extermum
arg_extrm_var = getattr(to_extrm_ary, cls.extermum_func)(dim='time').compute()
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1137, in compute
return new.load(**kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1111, in load
ds = self._to_temp_dataset().load(**kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataset.py", line 833, in load
evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
return compute(*data, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 628, in compute
results = schedule(dsk, keys, **kwargs)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 487, in __array__
return np.asarray(self.get_duck_array(), dtype=dtype)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 490, in get_duck_array
return self.array.get_duck_array()
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 667, in get_duck_array
return self.array.get_duck_array()
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 554, in get_duck_array
array = self.array[self.key]
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 861, in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 112, in _getitem
original_array = self.get_array(needs_lock=False)
File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 92, in get_array
variable = ds.variables[self.variable_name]
KeyError: 'elevation'
ERROR conda.cli.main_run:execute(49): `conda run python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files` failed. (See above for error)
@FariborzDaneshvar-NOAA were all runs successful? Or are there any possible failures among the 600 runs?
@SorooshMani-NOAA based on check_completion
command, all runs completed successfully.
{
"runs": "not_started - 100%"
}
Is there any other way to check whether all runs completed successfully or not?
@FariborzDaneshvar-NOAA thanks for checking. You can also do a tail -n1
on the JCG
file in the outputs
for all the runs setup/ensemble.dir/runs/*/outputs/JCG.out
(I'm actually not sure if the name was JCG
or something else.
The results should be that all the files end with something like etc. etc completed
But in any case if the check_completion
shows this probably it's not an issue with runs. You can also check the existance of maxelev
files for all runs or check nc
files for existence of the required variables, using ncdump -h
on all the run results (combined with tail
and grep
to just check the existence without printing a lot of stuff to screen)
@SorooshMani-NOAA Thanks for your feedback. tail -n1 JCG.out
returns the following message: JCG converged in 496 iterations
I checked the same output for another (Florence OFCL track) run on PW, which had different # itterations: JCG converged in 639 iterations
Can that be an issue?
I don't think that's an issue. It's just that different tracks require different number of iterations for convergence; which makes sense. Can you use a subset of those runs like 10 of them and try to combine and see what happens?
Also based on these test loops, looks like all output directories have out2d_1.nc
and maxelev.gr3
for i in ./*; do if ! test -f $i/outputs/out2d_1.nc; then echo $i; fi; done
--> Nothing!
for i in ./*; do if ! test -f $i/outputs/maxelev.gr3; then echo $i; fi; done
--> Nothing!
I don't think that's an issue. It's just that different tracks require different number of iterations for convergence; which makes sense. Can you use a subset of those runs like 10 of them and try to combine and see what happens?
Agreed. William also had similar suggestions yesterday. We could start by trying a subset first, and for a large set of ensemble runs, we can consider having multiple files instead of one giant file.
@SorooshMani-NOAA and @yichengt900 thanks for your comments. Now running combine_results
for 50 members, will keep you posted...
I don't think that's an issue. It's just that different tracks require different number of iterations for convergence; which makes sense. Can you use a subset of those runs like 10 of them and try to combine and see what happens?
Agreed. William also had similar suggestions yesterday. We could start by trying a subset first, and for a large set of ensemble runs, we can consider having multiple files instead of one giant file.
I ran combine_results
for 51 runs (50 members + original track) on a compute node with 40 cores (srun -N 1 -A coastal -n 40 -t 6:00:00 --pty bash
)
It created perturbations.nc
, maxele.63.nc
, maxvel.63.nc
, and fort.63.nc
. But failed with Segmentation fault (core dumped)
when was still writing fort.64.nc
@SorooshMani-NOAA @yichengt900 do we need fort.64.nc
? If yes, do you suggest using smaller subset to resolve this issue?
[2023-10-25 14:36:57,729] parsing.schism INFO : parsing from "."
[2023-10-25 14:36:57,937] parsing.schism WARNING : could not find any run directories with all the specified output patterns
[2023-10-25 14:36:58,005] parsing.schism WARNING : could not find any run directories with all the specified output patterns
[2023-10-25 14:36:58,053] parsing.schism WARNING : could not find any run directories with all the specified output patterns
[2023-10-25 14:36:58,111] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:05,257] parsing.schism WARNING : Files don't contain all the required variables!
[2023-10-25 14:37:05,362] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:33,001] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:33,039] parsing.schism WARNING : Files don't contain all the required variables!
[2023-10-25 14:37:33,092] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:33,113] parsing.schism WARNING : Files don't contain all the required variables!
[2023-10-25 14:37:33,168] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 15:15:36,005] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 15:15:36,942] parsing.schism WARNING : Files don't contain all the required variables!
[2023-10-25 15:15:37,079] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 16:12:58,883] parsing.schism INFO : found 51 run directories with all the specified output patterns
[2023-10-25 16:14:09,078] parsing.schism INFO : found 2 variable(s) in "['out2d_*.nc']": "max_elevation" (51, 673957), "max_elevation_times" (51, 673957)
[2023-10-25 16:14:09,488] parsing.schism INFO : subsetted 673957 out of 673957 total nodes (100.00%)
[2023-10-25 16:14:09,488] parsing.schism INFO : found 2 variable(s) in "['horizontalVelX_*.nc', 'horizontalVelY_*.nc']": "max_velocity" (51, 673957, 2), "max_velocity_times" (51, 673957, 2)
[2023-10-25 16:14:11,314] parsing.schism INFO : subsetted 673957 out of 673957 total nodes (100.00%)
[2023-10-25 16:14:11,338] parsing.schism INFO : writing to "analysis_dir/perturbations.nc"
[2023-10-25 16:14:11,572] parsing.schism INFO : writing to "analysis_dir/fort.63.nc"
[2023-10-25 16:57:30,406] parsing.schism INFO : writing to "analysis_dir/maxele.63.nc"
[2023-10-25 17:37:04,325] parsing.schism INFO : writing to "analysis_dir/maxvel.63.nc"
[2023-10-25 18:47:36,154] parsing.schism INFO : writing to "analysis_dir/fort.64.nc"
Segmentation fault (core dumped)
For our probabilistic analysis we just use perturbations and maxele, fort 64 is not needed as far as I understand
For our probabilistic analysis we just use perturbations and maxele, fort 64 is not needed as far as I understand
Ok, if that's the case, I will continue combining results in chunks of 50 runs.
Using more memory (300GB) with one core, reduced the computation time ~1 hour (5 instead of 6), but still getting Segmentation fault ( core dumped )
message when writing fort.64.nc
!
(srun -N 1 -A coastal --mem 300GB -t 8:00:00 --pty bash
)
@FariborzDaneshvar-NOAA one (hacky) way to avoid the need to go through creation of all 4 files is to move all of out2d and horizontal velocity output files into a separate folder, for example:
mkdir other_outs
for i in setup/ensemble.dir/runs/*/outputs;
do
dest=./other_outs/`dirname $i`
mkdir $dest
mv $i/out2d_*.nc $dest/
mv $i/horizontalVel*.nc $dest/
done
Or something along these lines. Before running this script you can test if it works fine by running this other one:
for i in setup/ensemble.dir/runs/*/outputs;
do
dest=./other_outs/`dirname $i`
echo "To create: $dest"
echo "$i/out2d_*.nc -> $dest/"
echo "$i/horizontalVel*.nc -> $dest/"
done
@SorooshMani-NOAA @yichengt900 combined_results
completed. Outputs in chunks of 50s (i.e., analysis_dir_501_550
) are here:
/scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/
What's next?
On a large mem node can you use an ipython or jupyter notebook session to try to combine a couple of these combined results? You can directly use dask and open_mfdataset, etc. to avoid running into memory issues
The main thing we want to get to is the combined results of all, so that we can run the klpc analysis.
Conversely, you can run the analysis on each of the 50-combined results and then pick the subsets that are created as a result of that in order to do the final combine. And the rerun the final analysis on the combined subset.
Does that make sense?
@SorooshMani-NOAA @yichengt900
combined_results
completed. Outputs in chunks of 50s (i.e.,analysis_dir_501_550
) are here:/scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/
What's next?
@FariborzDaneshvar-NOAA , What do you mean? What would you suggest?
Thanks @SorooshMani-NOAA . I also think it's a very good opportunity to verify if Schism still produces physically reasonable results from large ensemble runs (which I assume it should).
@SorooshMani-NOAA @yichengt900
combined_results
completed. Outputs in chunks of 50s (i.e.,analysis_dir_501_550
) are here:/scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/
What's next?@FariborzDaneshvar-NOAA , What do you mean? What would you suggest?
@saeed-moghimi-noaa I meant what would be the next step after combining results in separate sets!
On a large mem node can you use an ipython or jupyter notebook session to try to combine a couple of these combined results? You can directly use dask and open_mfdataset, etc. to avoid running into memory issues
The main thing we want to get to is the combined results of all, so that we can run the klpc analysis.
Conversely, you can run the analysis on each of the 50-combined results and then pick the subsets that are created as a result of that in order to do the final combine. And the rerun the final analysis on the combined subset.
Does that make sense?
Using a compute node with 300GB
memory, I was able to open them and concat only 2 sets (first 101 runs) before getting memory error. Adding the third set (runs 101-150) resulted in this memory error:
MemoryError: Unable to allocate 177. GiB for an array with shape (151, 468, 673957) and data type float 32
@SorooshMani-NOAA Can I use a node with more memory on Hera? alternatively how can I use dask / open_mfdataset?
I was able to combine and make one maxele.63.nc
for all 601 runs. Looks like perturbations.nc
already had 600 cases, so I will be able to do post-process analysis...
Thanks, @FariborzDaneshvar-NOAA . That's great. I am also testing analyze_ensemble.py
(file location: /scratch2/STI/coastal/Yi-cheng.Teng/tmp
with some modification to handle your outputs in chunks. I'll keep you posted. If you're interested in my modifications, please check lines 406-411
for how to use Dask jobqueue
on Hera
and how to read chunks of files using xarray.open_mfdataset
at line 151
.
@FariborzDaneshvar-NOAA , I was able to run analyze_ensemble.py
manually on Hera using the approach I mentioned here. At this moment I skip the make_sensitivities_plot
step and only generated results for manning n = 0.025. Results and surrogate are located here: /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/analyze/linear_k1_p1_n0.025
.
@FariborzDaneshvar-NOAA , I was able to run
analyze_ensemble.py
manually on Hera using the approach I mentioned here. At this moment I skip themake_sensitivities_plot
step and only generated results for manning n = 0.025. Results and surrogate are located here:/scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/analyze/linear_k1_p1_n0.025
.
Great! thanks @yichengt900 for looking into it, much appreciated.
This task was created to compare the performance of probabilistic model with P-Surge. I will close this ticket since we are no longer using this approach for comparison, instead Convert PSurge to SCHISM-Like .nc #153 was created for comparison of probabilistic models.
We'd like to validate the quasi Monte Carlo sampling approach we take for probabilistic results in ensemble perturbation. To do so, we're going to run an surge-only large ensemble and then compare the results against the other approach with small member count.