ufs-community / UFS_UTILS

Utilities for the NCEP models.
Other
21 stars 104 forks source link

running outside of NOAA HPC environments (in a docker container) #637

Closed StevePny closed 2 years ago

StevePny commented 2 years ago

It seems as if the UFS_UTILS has not been run outside of NOAA HPC environments. In order to run locally (in a docker container), there are a couple instances where the build expects aprun to be used. First, if APRUN is not provided as an environment variable, then in fv3gfs_make_grid.sh it defaults to "time" (APRUN=${APRUN:-time}), but this causes a problem within the bash shell (since 'time' is a built-in in bash). To address this, I installed time using "apt-get install -y time" and then provided an explicit location for APRUN=/usr/bin/time as an environment variable. However, "aprun" is also hard-coded in the final script sfc_climo_gen.sh. Here, it is not clear how to modify the script to run.

I tried to hard-code a correction, but it comes up with a strange error, since the number of tiles is 6 and so is the number of tasks:

APRUN_SFC='mpirun --allow-run-as-root -n 6 -N 6'
+ mpirun --allow-run-as-root -n 6 -N 6 /UFS_UTILS/exec/sfc_climo_gen
- NUMBER OF TILES, MODEL GRID IS            6
 - FATAL ERROR: MUST RUN THIS PROGRAM WITH A TASK COUNT THAT
 - IS A MULTIPLE OF THE NUMBER OF TILES.

Also, the fix_orog/ files were not publicly accessible (as far as I could find). I had to track them down via HPC access. I would recommend providing these files on a public ftp.

GeorgeGayno-NOAA commented 2 years ago

The 'fix_orog' files are publicly available. See our documentation for the community release: https://noaa-emcufs-utils.readthedocs.io/en/ufs-v2.0.0/ufs_utils.html#program-inputs-and-outputs-for-global-applications

GeorgeGayno-NOAA commented 2 years ago

I am puzzled by the error in sfc_climo_gen. It should run with a multiple of six mpi tasks. I usually use 24 tasks. But I can try running with six tasks on WCOSS.

StevePny commented 2 years ago

Hi George, the site that you provided is what I was following. It lists the files needed (see below), but it doesn't provide links. One of the links on the page might lead to these files, but if so, I haven't been able to find the right one.

Input data:

The “grid” files (CRES_grid.tile#.nc) containing the geo-reference records for the grid - (NetCDF). Created by the make_hgrid or regional_esg_grid programs.

Global 30-arc-second University of Maryland land cover data. Used to create the land-sea mask.
./fix/fix_orog/landcover30.fixed (unformatted binary)

Global 30-arc-second USGS GMTED2010 orography data.
./fix/fix_orog/gmted2010.30sec.int (unformatted binary)

30-arc-second RAMP Antarctic terrain data (Radarsat Antarctic Mapping Project)
./fix/fix_orog/thirty.second.antarctic.new.bin (unformatted binary)

The UFS_UTILS script "fix/link_fixdirs.sh" pulls these from NOAA HPC:

#------------------------------
#--model fix fields
#------------------------------
if [ $machine == "cray" ]; then
    FIX_DIR="/gpfs/hps3/emc/global/noscrub/emc.glopara/git/fv3gfs/fix"
elif [ $machine = "dell" ]; then
    FIX_DIR="/gpfs/dell2/emc/modeling/noscrub/emc.glopara/git/fv3gfs/fix"
elif [ $machine = "hera" ]; then
    FIX_DIR="/scratch1/NCEPDEV/global/glopara/fix"
elif [ $machine = "jet" ]; then
    FIX_DIR="/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix"
elif [ $machine = "orion" ]; then
    FIX_DIR="/work/noaa/global/glopara/fix"
elif [ $machine = "s4" ]; then
    FIX_DIR="/data/prod/glopara/fix"
fi
GeorgeGayno-NOAA commented 2 years ago

Sorry. Try these links:

https://ftp.emc.ncep.noaa.gov/static_files/public/UFS/GFS/fix_nco_gfsv16/

https://ftp.emc.ncep.noaa.gov/static_files/public/UFS/GFS/fix_nco_gfsv16/fix_orog/

StevePny commented 2 years ago

Thanks George. Those sites don't have all the necessary files. I found the rest (e.g. vegetation_type.modis.igbp.0.05.nc) here: https://ftp.emc.ncep.noaa.gov/static_files/public/UFS/GFS/fix/fix_sfc_climo/

Could you add these links to the corresponding section 3.6.3 on the https://noaa-emcufs-utils.readthedocs.io website?

I found the solution to the initial MPI problem -

When I run the sfc_climo_gen exe I see:

+ mpirun --allow-run-as-root -np 6 /UFS_UTILS/exec/sfc_climo_gen
 - INITIALIZE ESMF
 - INITIALIZE ESMF
 - INITIALIZE ESMF
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - INITIALIZE ESMF
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - INITIALIZE ESMF
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - READ NUMBER OF TILES
 - READ TILE NAMES
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - READ NUMBER OF TILES
 - READ TILE NAMES
 - READ NUMBER OF TILES
 - READ TILE NAMES
 - NUMBER OF TILES, MODEL GRID IS            6
 - FATAL ERROR: MUST RUN THIS PROGRAM WITH A TASK COUNT THAT
 - IS A MULTIPLE OF THE NUMBER OF TILES.
 - NUMBER OF TILES, MODEL GRID IS            6
 - FATAL ERROR: MUST RUN THIS PROGRAM WITH A TASK COUNT THAT
 - IS A MULTIPLE OF THE NUMBER OF TILES.
 - NUMBER OF TILES, MODEL GRID IS            6
 - FATAL ERROR: MUST RUN THIS PROGRAM WITH A TASK COUNT THAT
 - IS A MULTIPLE OF THE NUMBER OF TILES.

There is a required option when building ESMF. Even though openmpi is present, it defaults to a single-processor bypass mode when building specifically on Linux systems (e.g. in a Linux docker container):

From their documentation: "ESMF_COMM Possible value: system-dependent ... Alternatively, ESMF comes with a single-processor MPI-bypass library which is the default for Linux and Darwin systems. To force the use of this bypass library set ESMF_COMM equal to "mpiuni".

https://earthsystemmodeling.org/docs/release/ESMF_5_2_0/ESMF_usrdoc/node9.html#SECTION00094000000000000000

Also, ESMF must be built with this environment variable to enable netcdf:

ENV ESMF_NETCDF="standard"

Otherwise, I get a cryptic error about IOSTAT 49, which I tracked down with these resources: https://earthsystemmodeling.org/docs/release/ESMF_5_2_0/ESMC_crefdoc/node9.html#SECTION09020000000000000000 https://github.com/NOAA-EMC/fv3gfs/blob/master/sorc/fv3gfs.fd/NEMS/src/module_MEDIATOR_SpaceWeather.F90 https://earthsystemmodeling.org/docs/release/ESMF_5_2_0/ESMF_usrdoc/node9.html

StevePny commented 2 years ago

I still have a problem with the routine ending prematurely:

 - OPEN SOURCE FILE /fix_sfc_climo/snowfree_albedo.4comp.0.05.nc
 - CALL FieldScatter FOR SOURCE GRID DATA.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 549771a6c3cd exited on signal 9 (Killed).
--------------------------------------------------------------------------

@GeorgeGayno-NOAA do you have any idea what this could be? Perhaps a netcdf error?

GeorgeGayno-NOAA commented 2 years ago

Thanks George. Those sites don't have all the necessary files. I found the rest (e.g. vegetation_type.modis.igbp.0.05.nc) here: https://ftp.emc.ncep.noaa.gov/static_files/public/UFS/GFS/fix/fix_sfc_climo/

Could you add these links to the corresponding section 3.6.3 on the https://noaa-emcufs-utils.readthedocs.io website?

I found the solution to the initial MPI problem -

When I run the sfc_climo_gen exe I see:

+ mpirun --allow-run-as-root -np 6 /UFS_UTILS/exec/sfc_climo_gen
 - INITIALIZE ESMF
 - INITIALIZE ESMF
 - INITIALIZE ESMF
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - INITIALIZE ESMF
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - INITIALIZE ESMF
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - READ NUMBER OF TILES
 - READ TILE NAMES
 - CALL VMGetGlobal
 - CALL VMGet
 - NPETS IS             1
 - LOCAL PET            0
 - READ SETUP NAMELIST, LOCALPET:            0
 - OPEN MODEL GRID MOSAIC FILE: /tmp/out/C96/C96_mosaic.nc
 - READ NUMBER OF TILES
 - READ TILE NAMES
 - READ NUMBER OF TILES
 - READ TILE NAMES
 - NUMBER OF TILES, MODEL GRID IS            6
 - FATAL ERROR: MUST RUN THIS PROGRAM WITH A TASK COUNT THAT
 - IS A MULTIPLE OF THE NUMBER OF TILES.
 - NUMBER OF TILES, MODEL GRID IS            6
 - FATAL ERROR: MUST RUN THIS PROGRAM WITH A TASK COUNT THAT
 - IS A MULTIPLE OF THE NUMBER OF TILES.
 - NUMBER OF TILES, MODEL GRID IS            6
 - FATAL ERROR: MUST RUN THIS PROGRAM WITH A TASK COUNT THAT
 - IS A MULTIPLE OF THE NUMBER OF TILES.

There is a required option when building ESMF. Even though openmpi is present, it defaults to a single-processor bypass mode when building specifically on Linux systems (e.g. in a Linux docker container):

From their documentation: "ESMF_COMM Possible value: system-dependent ... Alternatively, ESMF comes with a single-processor MPI-bypass library which is the default for Linux and Darwin systems. To force the use of this bypass library set ESMF_COMM equal to "mpiuni".

https://earthsystemmodeling.org/docs/release/ESMF_5_2_0/ESMF_usrdoc/node9.html#SECTION00094000000000000000

Also, ESMF must be built with this environment variable to enable netcdf:

ENV ESMF_NETCDF="standard"

Otherwise, I get a cryptic error about IOSTAT 49, which I tracked down with these resources: https://earthsystemmodeling.org/docs/release/ESMF_5_2_0/ESMC_crefdoc/node9.html#SECTION09020000000000000000 https://github.com/NOAA-EMC/fv3gfs/blob/master/sorc/fv3gfs.fd/NEMS/src/module_MEDIATOR_SpaceWeather.F90 https://earthsystemmodeling.org/docs/release/ESMF_5_2_0/ESMF_usrdoc/node9.html

To do: The problem with the platform dependent "aprun" being hard-coded in the final script sfc_climo_gen.sh still needs to be corrected in the repo. Elsewhere in the scripts this type of command is passed as an environment variable.

Can you use the APRUN_SFC variable?

For now, every time I build I use: sed -i 's/aprun -j 1 -n 6 -N 6/mpirun --allow-run-as-root -n 6/' /UFS_UTILS/ush/sfc_climo_gen.sh

GeorgeGayno-NOAA commented 2 years ago

I still have a problem with the routine ending prematurely:

 - OPEN SOURCE FILE /fix_sfc_climo/snowfree_albedo.4comp.0.05.nc
 - CALL FieldScatter FOR SOURCE GRID DATA.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
 - CALL FieldRegridStore.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 549771a6c3cd exited on signal 9 (Killed).
--------------------------------------------------------------------------

@GeorgeGayno-NOAA do you have any idea what this could be? Perhaps a netcdf error?

Looks like it stops in the ESMF regrid step. Check the ESMF log files. They begin with "PET".

StevePny commented 2 years ago

There is no error message. What should the output look like? This is as far as they get (see below).

In the past when the process has stopped with no explanation it has been a memory issue, but I've increased the memory of the Docker container quite a bit and it still halts in the same place.

20220329 025220.269 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.269 INFO             PET0 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.269 INFO             PET0 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.269 INFO             PET0 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.269 INFO             PET0 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.269 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.269 INFO             PET0 Running with ESMF Version   : v8.3.0b11
20220329 025220.269 INFO             PET0 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.269 INFO             PET0 ESMF library build location : /tmp/esmf
20220329 025220.269 INFO             PET0 ESMF_COMM                   : openmpi
20220329 025220.271 INFO             PET0 ESMF_MOAB                   : enabled
20220329 025220.271 INFO             PET0 ESMF_LAPACK                 : enabled
20220329 025220.271 INFO             PET0 ESMF_NETCDF                 : enabled
20220329 025220.271 INFO             PET0 ESMF_PNETCDF                : disabled
20220329 025220.271 INFO             PET0 ESMF_PIO                    : enabled
20220329 025220.271 INFO             PET0 ESMF_YAMLCPP                : enabled
~                                                                                  
StevePny commented 2 years ago

"Can you use the APRUN_SFC variable?" Yes, I'm not sure how I missed that - I think it is my editor putting dark blue on a black background. That is what I was looking for.

GeorgeGayno-NOAA commented 2 years ago

There is no error message. What should the output look like? This is as far as they get (see below).

In the past when the process has stopped with no explanation it has been a memory issue, but I've increased the memory of the Docker container quite a bit and it still halts in the same place.

20220329 025220.269 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.269 INFO             PET0 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.269 INFO             PET0 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.269 INFO             PET0 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.269 INFO             PET0 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.269 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.269 INFO             PET0 Running with ESMF Version   : v8.3.0b11
20220329 025220.269 INFO             PET0 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.269 INFO             PET0 ESMF library build location : /tmp/esmf
20220329 025220.269 INFO             PET0 ESMF_COMM                   : openmpi
20220329 025220.271 INFO             PET0 ESMF_MOAB                   : enabled
20220329 025220.271 INFO             PET0 ESMF_LAPACK                 : enabled
20220329 025220.271 INFO             PET0 ESMF_NETCDF                 : enabled
20220329 025220.271 INFO             PET0 ESMF_PNETCDF                : disabled
20220329 025220.271 INFO             PET0 ESMF_PIO                    : enabled
20220329 025220.271 INFO             PET0 ESMF_YAMLCPP                : enabled
~                                                                                  

It looks like it is not an ESMF problem.

I am not familiar with your system. Are you requesting enough wall clock time?

kgerheiser commented 2 years ago

Each process writes its own log file. Do any of the other PET files show an error?

StevePny commented 2 years ago

"I am not familiar with your system. Are you requesting enough wall clock time?" I'm on a Mac M1 running in an ubuntu docker container using arm64. Docker is currently configured with:

Screen Shot 2022-03-29 at 2 06 11 PM

"Each process writes its own log file. Do any of the other PET files show an error?" Not that I see:

root@1c2934f2fa6a:/UFS_UTILS/ush# cat /rundir/tmp/sfcfields/PET*
20220329 025220.269 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.269 INFO             PET0 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.269 INFO             PET0 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.269 INFO             PET0 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.269 INFO             PET0 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.269 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.269 INFO             PET0 Running with ESMF Version   : v8.3.0b11
20220329 025220.269 INFO             PET0 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.269 INFO             PET0 ESMF library build location : /tmp/esmf
20220329 025220.269 INFO             PET0 ESMF_COMM                   : openmpi
20220329 025220.271 INFO             PET0 ESMF_MOAB                   : enabled
20220329 025220.271 INFO             PET0 ESMF_LAPACK                 : enabled
20220329 025220.271 INFO             PET0 ESMF_NETCDF                 : enabled
20220329 025220.271 INFO             PET0 ESMF_PNETCDF                : disabled
20220329 025220.271 INFO             PET0 ESMF_PIO                    : enabled
20220329 025220.271 INFO             PET0 ESMF_YAMLCPP                : enabled
20220329 025220.253 INFO             PET1 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.253 INFO             PET1 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.253 INFO             PET1 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.253 INFO             PET1 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.253 INFO             PET1 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.253 INFO             PET1 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.253 INFO             PET1 Running with ESMF Version   : v8.3.0b11
20220329 025220.253 INFO             PET1 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.253 INFO             PET1 ESMF library build location : /tmp/esmf
20220329 025220.253 INFO             PET1 ESMF_COMM                   : openmpi
20220329 025220.256 INFO             PET1 ESMF_MOAB                   : enabled
20220329 025220.256 INFO             PET1 ESMF_LAPACK                 : enabled
20220329 025220.256 INFO             PET1 ESMF_NETCDF                 : enabled
20220329 025220.256 INFO             PET1 ESMF_PNETCDF                : disabled
20220329 025220.256 INFO             PET1 ESMF_PIO                    : enabled
20220329 025220.256 INFO             PET1 ESMF_YAMLCPP                : enabled
20220329 025220.249 INFO             PET2 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.249 INFO             PET2 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.249 INFO             PET2 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.249 INFO             PET2 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.249 INFO             PET2 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.249 INFO             PET2 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.249 INFO             PET2 Running with ESMF Version   : v8.3.0b11
20220329 025220.250 INFO             PET2 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.250 INFO             PET2 ESMF library build location : /tmp/esmf
20220329 025220.250 INFO             PET2 ESMF_COMM                   : openmpi
20220329 025220.253 INFO             PET2 ESMF_MOAB                   : enabled
20220329 025220.253 INFO             PET2 ESMF_LAPACK                 : enabled
20220329 025220.253 INFO             PET2 ESMF_NETCDF                 : enabled
20220329 025220.253 INFO             PET2 ESMF_PNETCDF                : disabled
20220329 025220.253 INFO             PET2 ESMF_PIO                    : enabled
20220329 025220.253 INFO             PET2 ESMF_YAMLCPP                : enabled
20220329 025220.276 INFO             PET3 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.276 INFO             PET3 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.276 INFO             PET3 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.276 INFO             PET3 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.276 INFO             PET3 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.276 INFO             PET3 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.276 INFO             PET3 Running with ESMF Version   : v8.3.0b11
20220329 025220.276 INFO             PET3 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.276 INFO             PET3 ESMF library build location : /tmp/esmf
20220329 025220.276 INFO             PET3 ESMF_COMM                   : openmpi
20220329 025220.278 INFO             PET3 ESMF_MOAB                   : enabled
20220329 025220.278 INFO             PET3 ESMF_LAPACK                 : enabled
20220329 025220.278 INFO             PET3 ESMF_NETCDF                 : enabled
20220329 025220.278 INFO             PET3 ESMF_PNETCDF                : disabled
20220329 025220.278 INFO             PET3 ESMF_PIO                    : enabled
20220329 025220.278 INFO             PET3 ESMF_YAMLCPP                : enabled
20220329 025220.261 INFO             PET4 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.261 INFO             PET4 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.261 INFO             PET4 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.261 INFO             PET4 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.261 INFO             PET4 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.261 INFO             PET4 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.261 INFO             PET4 Running with ESMF Version   : v8.3.0b11
20220329 025220.261 INFO             PET4 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.261 INFO             PET4 ESMF library build location : /tmp/esmf
20220329 025220.261 INFO             PET4 ESMF_COMM                   : openmpi
20220329 025220.265 INFO             PET4 ESMF_MOAB                   : enabled
20220329 025220.265 INFO             PET4 ESMF_LAPACK                 : enabled
20220329 025220.265 INFO             PET4 ESMF_NETCDF                 : enabled
20220329 025220.265 INFO             PET4 ESMF_PNETCDF                : disabled
20220329 025220.265 INFO             PET4 ESMF_PIO                    : enabled
20220329 025220.265 INFO             PET4 ESMF_YAMLCPP                : enabled
20220329 025220.257 INFO             PET5 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.257 INFO             PET5 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
20220329 025220.257 INFO             PET5 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
20220329 025220.257 INFO             PET5 !!! FOR PRODUCTION RUNS, USE:                      !!!
20220329 025220.257 INFO             PET5 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
20220329 025220.257 INFO             PET5 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
20220329 025220.257 INFO             PET5 Running with ESMF Version   : v8.3.0b11
20220329 025220.257 INFO             PET5 ESMF library build date/time: "Mar 29 2022" "02:20:31"
20220329 025220.257 INFO             PET5 ESMF library build location : /tmp/esmf
20220329 025220.257 INFO             PET5 ESMF_COMM                   : openmpi
20220329 025220.259 INFO             PET5 ESMF_MOAB                   : enabled
20220329 025220.259 INFO             PET5 ESMF_LAPACK                 : enabled
20220329 025220.259 INFO             PET5 ESMF_NETCDF                 : enabled
20220329 025220.259 INFO             PET5 ESMF_PNETCDF                : disabled
20220329 025220.259 INFO             PET5 ESMF_PIO                    : enabled
20220329 025220.259 INFO             PET5 ESMF_YAMLCPP                : enabled
StevePny commented 2 years ago

Ok, solved. It turns out it was in fact a resource issue, I just didn't increase the memory enough. I maxed it out at 32 GB and the whole thing runs to completion:

Screen Shot 2022-03-29 at 3 01 56 PM

finishing with:

+ mv vegetation_type.tile1.nc /rundir/out/C96/fix_sfc/C96.vegetation_type.tile1.nc
+ for files in *.nc
+ [[ -f vegetation_type.tile2.nc ]]
+ mv vegetation_type.tile2.nc /rundir/out/C96/fix_sfc/C96.vegetation_type.tile2.nc
+ for files in *.nc
+ [[ -f vegetation_type.tile3.nc ]]
+ mv vegetation_type.tile3.nc /rundir/out/C96/fix_sfc/C96.vegetation_type.tile3.nc
+ for files in *.nc
+ [[ -f vegetation_type.tile4.nc ]]
+ mv vegetation_type.tile4.nc /rundir/out/C96/fix_sfc/C96.vegetation_type.tile4.nc
+ for files in *.nc
+ [[ -f vegetation_type.tile5.nc ]]
+ mv vegetation_type.tile5.nc /rundir/out/C96/fix_sfc/C96.vegetation_type.tile5.nc
+ for files in *.nc
+ [[ -f vegetation_type.tile6.nc ]]
+ mv vegetation_type.tile6.nc /rundir/out/C96/fix_sfc/C96.vegetation_type.tile6.nc
+ exit 0
+ err=0
+ '[' 0 '!=' 0 ']'
+ '[' uniform = regional_gfdl ']'
+ '[' uniform = regional_esg ']'
+ '[' uniform = nest ']'
+ exit
GeorgeGayno-NOAA commented 2 years ago

@StevePny Is everything working for you? Can we close this issue?