Run ~600 ensemble members w SCHISM (no NWM) to validate QuasiMC approach

SorooshMani-NOAA commented 9 months ago

We'd like to validate the quasi Monte Carlo sampling approach we take for probabilistic results in ensemble perturbation. To do so, we're going to run an surge-only large ensemble and then compare the results against the other approach with small member count.

SorooshMani-NOAA commented 9 months ago

@yichengt900 this is the ticket where we're going to discuss running 600 member ensemble

FariborzDaneshvar-NOAA commented 9 months ago

@SorooshMani-NOAA a handful of runs failed with this MPI error:

+ pushd /lustre/hurricanes/florence_2018_91e01ecc-c18d-4537-979b-223d37519634/setup/ensemble.dir/runs/vortex_4_variable_korobov_499
/lustre/hurricanes/florence_2018_91e01ecc-c18d-4537-979b-223d37519634/setup/ensemble.dir/runs/vortex_4_variable_korobov_499 ~/ondemand-storm-workflow/singularity/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images/solve.sif pschism_PAHM_TVD-VL 4
[sorooshmani-nhccolab2-00012-1-0023:23888] Impossible to open the file /lustre/.tmp/ompi.sorooshmani-nhccolab2-00012-1-0023.21145/pid.23888/contact.txt in write mode
[sorooshmani-nhccolab2-00012-1-0023:23888] [[65232,0],0] ORTE_ERROR_LOG: File open failure in file util/hnp_contact.c at line 91
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_segment.c at line 207
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 541
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_base.c at line 2436
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_segment.c at line 207
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 541
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: ERROR in file dstore_base.c at line 2436
[sorooshmani-nhccolab2-00012-1-0023:24334] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00012-1-0023:24334] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00012-1-0023:23888] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00012-1-0023:24331] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
[sorooshmani-nhccolab2-00012-1-0023:24361] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[65232,1],2]
  Exit code:    1
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00012-1-0023:23888] 2 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00012-1-0023:23888] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Should I rerun them in small chunks, to avoid these errors?

SorooshMani-NOAA commented 9 months ago

Should I rerun them in small chunks, to avoid these errors?

I think so

FariborzDaneshvar-NOAA commented 9 months ago

@SorooshMani-NOAA ~50 ensembles runs from the first try are still running (more than 14 hours)!! and based on check_completion, their % have not changed in the last few hours! Does it make sense?! Should I still keep them running or cancel them?

Only 32 runs (out of 601) completed in the first round, and the majority (500+) failed with the error mentioned above.

FariborzDaneshvar-NOAA commented 9 months ago

Looks like ~ 50 runs mentioned in the previous comment are frozen, with no progress since yesterday. So I cancelled them to free ~88 nodes. Two notes from Thursday tagup:

Takis suggested it might be related to intel compiler version!
Saeed suggested running it on Hera instead of PW.

SorooshMani-NOAA commented 9 months ago

We can definitely transfer and run them on Hera, but do we have Hera computation now?

I'm not sure about the compiler issue. It is more likely a Singularity issue, I was told at some point that running many containers at the same time might result in some getting stuck. You can write a script to run them in batches of 10 or something like that maybe

FariborzDaneshvar-NOAA commented 9 months ago

We can definitely transfer and run them on Hera, but do we have Hera computation now?

@saeed-moghimi-noaa mentioned earlier today that it should not be an issue. Please correct me if I'm wrong.

FariborzDaneshvar-NOAA commented 9 months ago

Paths on Hera:

Run directory: /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/
Singularity images: /scratch2/STI/coastal/Shared/singularity_images/
Bash script to run multiple ensembles: /scratch2/STI/coastal/Fariborz.Daneshvar/my_scripts/submit_mutli_schism.sh

Required settings for running on Hera:

Unlike PW, account should be defined for SBATCH command (i.e., --account=coastal)
Define the runtime for each run (default is 20min) (i.e., --time=06:00:00)
Add gnu to MODULES and update openmpi version to what is available on Hera (i.e., MODULES=gnu openmpi/3.1.4 instead of MODULES=openmpi/4.1.2 used on PW)
Load gnu and openmpi modules before executing the bash script
Update --bind /lustre to --bind /scratch2

Here is example bash script file (submit_mutli_schism.sh)with updates mentioned above to run two members:

#!/bin/bash
set -e

run_dir=$1
IMG=/scratch2/STI/coastal/Shared/singularity_images/solve.sif
SBATCH=/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/scripts/schism.sbatch
SCHISM_EXEC='pschism_PAHM_TVD-VL' 

# Environment
export SINGULARITY_BINDFLAGS="--bind /scratch2"
echo $run_dir

echo "Launching runs"
SCHISM_SHARED_ENV=""
SCHISM_SHARED_ENV+="ALL"
SCHISM_SHARED_ENV+=",IMG=$IMG"
jCHISM_SHARED_ENV+=",MODULES=gnu openmpi/3.1.4"
joblist=""
for i in 3 4; do
      jobid=$(
          sbatch --parsable --account=coastal --time=06:00:00\
          --export=$SCHISM_SHARED_ENV,SCHISM_DIR="$run_dir/setup/ensemble.dir/runs/vortex_4_variable_korobov_$i",SCHISM_EXEC=$SCHISM_EXEC \
          $SBATCH
          )
      joblist+=":$jobid"
done
echo "Submitted ${joblist}"

Update on Runtime: I first set --time=4:00:00, but runs did not complete in 4 hours (~85%)! so I increased it to 6 hours!

FariborzDaneshvar-NOAA commented 9 months ago

@SorooshMani-NOAA, @yichengt900 600 runs completed on Hera.

The combine_results command failed with MemoryError on compute node with 40 cores! (srun -N 1 -A coastal -n 40 -t 8:00:00 --pty bash).

Here is the full message:

[2023-10-18 22:15:49,072] parsing.schism  INFO    : found 601 run directories with all the specified output patterns
Traceback (most recent call last):
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/bin/combine_results", line 8, in <module>                                                sys.exit(main())
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 106, in main
      combine_results(**parse_combine_results())
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 92, in combine_results                                                                                                                                                parsed_data = combine_func(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1308, in convert_schism_output_files_to_adcirc_like                                                                                                                           results = combine_outputs(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1060, in combine_outputs
      parsed_files = parse_schism_outputs(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 977, in parse_schism_outputs
      dataset = output_class.read_directory(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 744, in read_directory
      dataset = super().read_directory(directory, variables, parallel)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 663, in read_directory
      ds = cls._calc_extermum(full_ds)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 683, in _calc_extermum                                                                                                                                                        arg_extrm_var = getattr(to_extrm_ary, cls.extermum_func)(dim='time').compute()
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1137, in compute
      return new.load(**kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1111, in load
      ds = self._to_temp_dataset().load(**kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataset.py", line 833, in load
      evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
      return compute(*data, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 621, in compute
      dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 394, in collections_to_dsk
      dsk = opt(dsk, keys, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/array/optimization.py", line 51, in optimize
      dsk = dsk.cull(set(keys))
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/highlevelgraph.py", line 763, in cull
      ret_key_deps.update(culled_deps)
MemoryError

I cannot increase no. of cores more than 40! Got this error message for 41: Unable to allocate resources: Requested node configuration is not available

FariborzDaneshvar-NOAA commented 9 months ago

As @yichengt900 suggested, I also tested salloc on the compute node to run combine_results manually as follow: salloc --ntasks 120 --exclusive --qos=debug --time=00:30:00 --account=coastal

conda run python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files

But it failed with MemoryError (see below):

[2023-10-19 17:18:06,096] parsing.schism  INFO    : found 601 run directories with all the specified output patterns

Traceback (most recent call last):
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 31, in <module>
      main(parser.parse_args())
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 16, in main
      output = combine_results(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 92, in combine_results
      parsed_data = combine_func(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1308, in convert_schism_output_files_to_adcirc_like
      results = combine_outputs(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1060, in combine_outputs
      parsed_files = parse_schism_outputs(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 977, in parse_schism_outputs
      dataset = output_class.read_directory(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 744, in read_directory
      dataset = super().read_directory(directory, variables, parallel)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 663, in read_directory
      ds = cls._calc_extermum(full_ds)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 683, in _calc_extermum
      arg_extrm_var = getattr(to_extrm_ary, cls.extermum_func)(dim='time').compute()
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1137, in compute
      return new.load(**kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1111, in load
      ds = self._to_temp_dataset().load(**kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataset.py", line 833, in load
      evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute 
      return compute(*data, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 621, in compute
      dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 394, in collections_to_dsk
      dsk = opt(dsk, keys, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/array/optimization.py", line 51, in optimize
      dsk = dsk.cull(set(keys))
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/highlevelgraph.py", line 763, in cull
      ret_key_deps.update(culled_deps)
MemoryError

FariborzDaneshvar-NOAA commented 9 months ago

@yichengt900 @SorooshMani-NOAA, I used 8 hours for salloc (see the full commands below) salloc --ntasks 20 --mem=300GB --qos=batch --time=08:00:00 --account=coastal

The combine command failed with this segmentation error in less than 2 hours!:

[2023-10-23 20:07:51,760] parsing.schism  INFO    : found 601 run directories with all the specified output patterns

/tmp/tmp0ccf9puv: line 3: 116325 Segmentation fault      (core dumped) python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files 

ERROR conda.cli.main_run:execute(49): `conda run python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files` failed. (See above for error)

FariborzDaneshvar-NOAA commented 9 months ago

The second time it failed with KeyError: 'elevation' after ~3 hours!

[2023-10-23 15:40:21,666] parsing.schism  INFO    : found 601 run directories with all the specified output patterns                                                                                                                                                                Traceback (most recent call last):
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 31, in <module>

      main(parser.parse_args())
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py", line 16, in main
      output = combine_results(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/client/combine_results.py", line 92, in combine_results
      parsed_data = combine_func(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1308, in convert_schism_output_files_to_adcirc_like
      results = combine_outputs(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 1060, in combine_outputs
      parsed_files = parse_schism_outputs(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 977, in parse_schism_outputs
      dataset = output_class.read_directory(
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 744, in read_directory
      dataset = super().read_directory(directory, variables, parallel)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 663, in read_directory
      ds = cls._calc_extermum(full_ds)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/ensembleperturbation/parsing/schism.py", line 683, in _calc_extermum
      arg_extrm_var = getattr(to_extrm_ary, cls.extermum_func)(dim='time').compute()
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1137, in compute
      return new.load(**kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1111, in load
      ds = self._to_temp_dataset().load(**kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/dataset.py", line 833, in load
      evaluated_data = chunkmanager.compute(*lazy_data.values(), **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
      return compute(*data, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/dask/base.py", line 628, in compute
      results = schedule(dsk, keys, **kwargs)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 487, in __array__
      return np.asarray(self.get_duck_array(), dtype=dtype)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 490, in get_duck_array
      return self.array.get_duck_array()
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 667, in get_duck_array
      return self.array.get_duck_array()
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 554, in get_duck_array
      array = self.array[self.key]
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/core/indexing.py", line 861, in explicit_indexing_adapter
      result = raw_indexing_method(raw_key.tuple)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 112, in _getitem
      original_array = self.get_array(needs_lock=False)
   File "/scratch2/STI/coastal/Fariborz.Daneshvar/miniconda3/envs/nhc_colab/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 92, in get_array
      variable = ds.variables[self.variable_name]
KeyError: 'elevation' 

ERROR conda.cli.main_run:execute(49): `conda run python ondemand-storm-workflow/singularity/prep/files/combine_ensemble.py -d /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir -t /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/track_files` failed. (See above for error)

SorooshMani-NOAA commented 9 months ago

@FariborzDaneshvar-NOAA were all runs successful? Or are there any possible failures among the 600 runs?

FariborzDaneshvar-NOAA commented 9 months ago

@SorooshMani-NOAA based on check_completion command, all runs completed successfully.

{

            "runs": "not_started - 100%" 
}

Is there any other way to check whether all runs completed successfully or not?

SorooshMani-NOAA commented 9 months ago

@FariborzDaneshvar-NOAA thanks for checking. You can also do a tail -n1 on the JCG file in the outputs for all the runs setup/ensemble.dir/runs/*/outputs/JCG.out (I'm actually not sure if the name was JCG or something else.

The results should be that all the files end with something like etc. etc completed

SorooshMani-NOAA commented 9 months ago

But in any case if the check_completion shows this probably it's not an issue with runs. You can also check the existance of maxelev files for all runs or check nc files for existence of the required variables, using ncdump -h on all the run results (combined with tail and grep to just check the existence without printing a lot of stuff to screen)

FariborzDaneshvar-NOAA commented 9 months ago

@SorooshMani-NOAA Thanks for your feedback. tail -n1 JCG.out returns the following message: JCG converged in 496 iterations I checked the same output for another (Florence OFCL track) run on PW, which had different # itterations: JCG converged in 639 iterations Can that be an issue?

SorooshMani-NOAA commented 9 months ago

I don't think that's an issue. It's just that different tracks require different number of iterations for convergence; which makes sense. Can you use a subset of those runs like 10 of them and try to combine and see what happens?

FariborzDaneshvar-NOAA commented 9 months ago

Also based on these test loops, looks like all output directories have out2d_1.nc and maxelev.gr3 for i in ./*; do if ! test -f $i/outputs/out2d_1.nc; then echo $i; fi; done --> Nothing! for i in ./*; do if ! test -f $i/outputs/maxelev.gr3; then echo $i; fi; done --> Nothing!

yichengt900 commented 9 months ago

I don't think that's an issue. It's just that different tracks require different number of iterations for convergence; which makes sense. Can you use a subset of those runs like 10 of them and try to combine and see what happens?

Agreed. William also had similar suggestions yesterday. We could start by trying a subset first, and for a large set of ensemble runs, we can consider having multiple files instead of one giant file.

FariborzDaneshvar-NOAA commented 9 months ago

@SorooshMani-NOAA and @yichengt900 thanks for your comments. Now running combine_results for 50 members, will keep you posted...

FariborzDaneshvar-NOAA commented 9 months ago

I don't think that's an issue. It's just that different tracks require different number of iterations for convergence; which makes sense. Can you use a subset of those runs like 10 of them and try to combine and see what happens?

Agreed. William also had similar suggestions yesterday. We could start by trying a subset first, and for a large set of ensemble runs, we can consider having multiple files instead of one giant file.

I ran combine_results for 51 runs (50 members + original track) on a compute node with 40 cores (srun -N 1 -A coastal -n 40 -t 6:00:00 --pty bash)

It created perturbations.nc, maxele.63.nc, maxvel.63.nc, and fort.63.nc. But failed with Segmentation fault (core dumped) when was still writing fort.64.nc @SorooshMani-NOAA @yichengt900 do we need fort.64.nc? If yes, do you suggest using smaller subset to resolve this issue?

[2023-10-25 14:36:57,729] parsing.schism  INFO    : parsing from "."
[2023-10-25 14:36:57,937] parsing.schism  WARNING : could not find any run directories with all the specified output patterns
[2023-10-25 14:36:58,005] parsing.schism  WARNING : could not find any run directories with all the specified output patterns
[2023-10-25 14:36:58,053] parsing.schism  WARNING : could not find any run directories with all the specified output patterns
[2023-10-25 14:36:58,111] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:05,257] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-10-25 14:37:05,362] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:33,001] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:33,039] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-10-25 14:37:33,092] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 14:37:33,113] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-10-25 14:37:33,168] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 15:15:36,005] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 15:15:36,942] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-10-25 15:15:37,079] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 16:12:58,883] parsing.schism  INFO    : found 51 run directories with all the specified output patterns
[2023-10-25 16:14:09,078] parsing.schism  INFO    : found 2 variable(s) in "['out2d_*.nc']": "max_elevation" (51, 673957), "max_elevation_times" (51, 673957)
[2023-10-25 16:14:09,488] parsing.schism  INFO    : subsetted 673957 out of 673957 total nodes (100.00%)
[2023-10-25 16:14:09,488] parsing.schism  INFO    : found 2 variable(s) in "['horizontalVelX_*.nc', 'horizontalVelY_*.nc']": "max_velocity" (51, 673957, 2), "max_velocity_times" (51, 673957, 2)
[2023-10-25 16:14:11,314] parsing.schism  INFO    : subsetted 673957 out of 673957 total nodes (100.00%)
[2023-10-25 16:14:11,338] parsing.schism  INFO    : writing to "analysis_dir/perturbations.nc"
[2023-10-25 16:14:11,572] parsing.schism  INFO    : writing to "analysis_dir/fort.63.nc"
[2023-10-25 16:57:30,406] parsing.schism  INFO    : writing to "analysis_dir/maxele.63.nc"
[2023-10-25 17:37:04,325] parsing.schism  INFO    : writing to "analysis_dir/maxvel.63.nc"
[2023-10-25 18:47:36,154] parsing.schism  INFO    : writing to "analysis_dir/fort.64.nc"
Segmentation fault (core dumped)

SorooshMani-NOAA commented 9 months ago

For our probabilistic analysis we just use perturbations and maxele, fort 64 is not needed as far as I understand

FariborzDaneshvar-NOAA commented 9 months ago

For our probabilistic analysis we just use perturbations and maxele, fort 64 is not needed as far as I understand

Ok, if that's the case, I will continue combining results in chunks of 50 runs.

FariborzDaneshvar-NOAA commented 9 months ago

Using more memory (300GB) with one core, reduced the computation time ~1 hour (5 instead of 6), but still getting Segmentation fault ( core dumped ) message when writing fort.64.nc!

(srun -N 1 -A coastal --mem 300GB -t 8:00:00 --pty bash)

SorooshMani-NOAA commented 8 months ago

@FariborzDaneshvar-NOAA one (hacky) way to avoid the need to go through creation of all 4 files is to move all of out2d and horizontal velocity output files into a separate folder, for example:

mkdir other_outs
for i in setup/ensemble.dir/runs/*/outputs;
do
    dest=./other_outs/`dirname $i`
    mkdir $dest
    mv $i/out2d_*.nc $dest/
    mv $i/horizontalVel*.nc $dest/
done

Or something along these lines. Before running this script you can test if it works fine by running this other one:

for i in setup/ensemble.dir/runs/*/outputs;
do
    dest=./other_outs/`dirname $i`
    echo "To create: $dest"
    echo "$i/out2d_*.nc -> $dest/"
    echo "$i/horizontalVel*.nc -> $dest/"
done

FariborzDaneshvar-NOAA commented 8 months ago

@SorooshMani-NOAA @yichengt900 combined_results completed. Outputs in chunks of 50s (i.e., analysis_dir_501_550) are here: /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/

What's next?

SorooshMani-NOAA commented 8 months ago

On a large mem node can you use an ipython or jupyter notebook session to try to combine a couple of these combined results? You can directly use dask and open_mfdataset, etc. to avoid running into memory issues

The main thing we want to get to is the combined results of all, so that we can run the klpc analysis.

Conversely, you can run the analysis on each of the 50-combined results and then pick the subsets that are created as a result of that in order to do the final combine. And the rerun the final analysis on the combined subset.

Does that make sense?

saeed-moghimi-noaa commented 8 months ago

@SorooshMani-NOAA @yichengt900 combined_results completed. Outputs in chunks of 50s (i.e., analysis_dir_501_550) are here: /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/

What's next?

@FariborzDaneshvar-NOAA , What do you mean? What would you suggest?

yichengt900 commented 8 months ago

Thanks @SorooshMani-NOAA . I also think it's a very good opportunity to verify if Schism still produces physically reasonable results from large ensemble runs (which I assume it should).

FariborzDaneshvar-NOAA commented 8 months ago

@SorooshMani-NOAA @yichengt900 combined_results completed. Outputs in chunks of 50s (i.e., analysis_dir_501_550) are here: /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/ What's next?

@FariborzDaneshvar-NOAA , What do you mean? What would you suggest?

@saeed-moghimi-noaa I meant what would be the next step after combining results in separate sets!

FariborzDaneshvar-NOAA commented 8 months ago

On a large mem node can you use an ipython or jupyter notebook session to try to combine a couple of these combined results? You can directly use dask and open_mfdataset, etc. to avoid running into memory issues

The main thing we want to get to is the combined results of all, so that we can run the klpc analysis.

Conversely, you can run the analysis on each of the 50-combined results and then pick the subsets that are created as a result of that in order to do the final combine. And the rerun the final analysis on the combined subset.

Does that make sense?

Using a compute node with 300GB memory, I was able to open them and concat only 2 sets (first 101 runs) before getting memory error. Adding the third set (runs 101-150) resulted in this memory error: MemoryError: Unable to allocate 177. GiB for an array with shape (151, 468, 673957) and data type float 32

@SorooshMani-NOAA Can I use a node with more memory on Hera? alternatively how can I use dask / open_mfdataset?

FariborzDaneshvar-NOAA commented 8 months ago

I was able to combine and make one maxele.63.nc for all 601 runs. Looks like perturbations.nc already had 600 cases, so I will be able to do post-process analysis...

yichengt900 commented 8 months ago

Thanks, @FariborzDaneshvar-NOAA . That's great. I am also testing analyze_ensemble.py (file location: /scratch2/STI/coastal/Yi-cheng.Teng/tmp with some modification to handle your outputs in chunks. I'll keep you posted. If you're interested in my modifications, please check lines 406-411 for how to use Dask jobqueue on Heraand how to read chunks of files using xarray.open_mfdataset at line 151.

yichengt900 commented 8 months ago

@FariborzDaneshvar-NOAA , I was able to run analyze_ensemble.py manually on Hera using the approach I mentioned here. At this moment I skip the make_sensitivities_plot step and only generated results for manning n = 0.025. Results and surrogate are located here: /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/analyze/linear_k1_p1_n0.025.

FariborzDaneshvar-NOAA commented 8 months ago

@FariborzDaneshvar-NOAA , I was able to run analyze_ensemble.py manually on Hera using the approach I mentioned here. At this moment I skip the make_sensitivities_plot step and only generated results for manning n = 0.025. Results and surrogate are located here: /scratch2/STI/coastal/Shared/hurricanes/florence_2018_OFCL_korobov_600/setup/ensemble.dir/analyze/linear_k1_p1_n0.025.

Great! thanks @yichengt900 for looking into it, much appreciated.

FariborzDaneshvar-NOAA commented 6 months ago

This task was created to compare the performance of probabilistic model with P-Surge. I will close this ticket since we are no longer using this approach for comparison, instead Convert PSurge to SCHISM-Like .nc #153 was created for comparison of probabilistic models.

noaa-ocs-modeling / EnsemblePerturbation

Run ~600 ensemble members w SCHISM (no NWM) to validate QuasiMC approach #113