ufs-community / ufs-weather-model

UFS Weather Model
Other
130 stars 238 forks source link

add control_c384gdas_wav and control_atmwav tests #620

Closed aliabdolali closed 2 years ago

aliabdolali commented 3 years ago

This pull request updates the control tests:

The new grids, mod_defs and rmp_src* for gfsv16 are added to the input-data-20210614/WW3_input_data_20210621 on hera and orion.

new baselines are needed.

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

DeniseWorthen commented 2 years ago

@aliabdolali Please remove the additional control_gdas_wav.nml.IN and instead use a variable in the default test control_gdas.nml.IN

+  cplwav       = @[CPLWAV]

CPLWAV is false by default, so in your new test control_c384gdas_wav set CPLWAV=.true. and change the namelist name that the test uses.

aliabdolali commented 2 years ago

@DeniseWorthen I am working on it, thanks

aliabdolali commented 2 years ago

@DeniseWorthen I did the suggested changes and tested them.

DeniseWorthen commented 2 years ago

changes look good.

DeniseWorthen commented 2 years ago

@aliabdolali We may be moving the commit date for this PR up, depending on the status of Moorthi's PR. Have you added the WW3_input_data_20210518 to additional platforms ?

Which platforms do you not have access to that you would need me to do that?

aliabdolali commented 2 years ago

@DeniseWorthen I was waiting for ufs-weather-model team to give me the time to start running final test and add log files. I have the inputs on hera and can add them to orion and run on these 2 platforms. I also got WCOSS access recently, but have not run anything on it, and probably do not have write permission? Could you tell me if this is the time to start, so I can copy inputs to orion, and run on hera and orion with intel?

aliabdolali commented 2 years ago

@DeniseWorthen On another note, since we have two new tests, do I need to create a new baseline? rt.sh -c -e rt.conf?

DeniseWorthen commented 2 years ago

We need to have the data added across all platforms before we begin testing but we should do it ahead of time so we're ready to go. I won't know exactly when we can start the commit process until later today---it depends on how far Moorthi gets on resolving the issues with his PR.

I will add the needed WW3 input across platforms since you have limited access. There is no wcoss-dell access right now so that will need to wait.

For the new baseline, yes, we will need to create a new baseline for that single test. Since you're updating WW3 also it might be best to create all new baselines even though you expect no change from the WW3 change alone.

aliabdolali commented 2 years ago

@DeniseWorthen I just transferred the inputs to Orion /work/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20210614/WW3_input_data_20210621

DeniseWorthen commented 2 years ago

@aliabdolali Please update the BL_DATE in rt.sh to today's date and commit that change. We we will use the auto-BL creation feature; WCOSS needs to be run manually but we will need to wait for access.

BrianCurtis-NOAA commented 2 years ago

Machine: hera Compiler: gnu Job: BL Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/661779909/20210623200042/ufs-weather-model Please make changes and add the following label back: hera-gnu-BL

BrianCurtis-NOAA commented 2 years ago

Machine: hera Compiler: intel Job: BL Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/661779909/20210623200529/ufs-weather-model Please manually delete: /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_293168 Test control_c384gdas_wav 081 failed failed Test control_c384gdas_wav 081 failed in run_test failed Test control_atmwav 080 failed in check_result failed Test control_atmwav 080 failed in run_test failed Please make changes and add the following label back: hera-intel-BL

DeniseWorthen commented 2 years ago

@aliabdolali Please check the run directory /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_293168 for the failures. The 384gdas test failed with

312:
312:                     *** WAVEWATCH III Multi-grid shell ***
312:                =================================================
312:
321:
321: EXTCDE MPI_ABORT, IEXIT=  2000

The control_atmwav test failed because it did not find these files:

 Moving out_grd.glo_1deg .........NOT OK. Missing  /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_293168/control_atmwav/out_grd.glo_1deg
 Moving out_pnt.points .........NOT OK. Missing  /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_293168/control_atmwav/out_pnt.points
DeniseWorthen commented 2 years ago

@aliabdolali Let me know when you have finished your fixes. I see you've made a change for the control_atmwav failure.

aliabdolali commented 2 years ago

@DeniseWorthen bot tests are fixed now. for the c384gdas_wave test, I had to revert export CPLWAV=.T., otherwise, the mod_def files could not be copied over. Where can we fix it? in fv3_conf? I did not touch other parts of the code, to avoid crashes for other CPLWAV cases. In the future, if we fix export CPLWAV=.true., then we can use it.

DeniseWorthen commented 2 years ago

Are you saying that you needed cplwav to be ".true." and not ".T."? We can fix this tomorrow. The baselines for the remaining tests completed so I can move those and we'll just create the two failed tests manually.

aliabdolali commented 2 years ago

@DeniseWorthen no, there is no problem when it is cplwav=.T., but fails if I use cplwav=.true. The test was running fine until yesterday with cplwav=.T. From a conversation we had yesterday, you mentioned .true. is preferred to .T., so I removed cplwav=.T. And the problem is that it does not copy mod_def files from the input directory. My guess is that we need a fix somewhere in _test/fv3conf to handle T or true equally. Does it make sense?

DeniseWorthen commented 2 years ago

Yes, I realize that I asked you to switch to true. We should use the .T. because that is how the fv3_conf expects it. Sorry about that.

For the files to compare, you're not comparing any wave restart file, only wave output?

aliabdolali commented 2 years ago

@DeniseWorthen for the current setting, I have used the ww3_multi template, which does not write restart. @JessicaMeixner-NOAA do we want the restart files? then I can modify it and add the restart at last time step to the comparison list. It is a simple change.

JessicaMeixner-NOAA commented 2 years ago

@aliabdolali see these lines: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/ww3_multi.inp.IN#L324-L325 if you set the variables as needed you should be able to write restarts if desired.

DeniseWorthen commented 2 years ago

Two items:

One, I do think we want the comparison files to be the restart files, and not the output files.

Second, the tests that you are adding/modifying are not starting from a WW3 ICs, is that correct? In other words, WW3 starts "cold" in these two tests.

aliabdolali commented 2 years ago

@DeniseWorthen I can take out the output files from the list, just let me know. Yes, both start from cold.

DeniseWorthen commented 2 years ago

Great. Could you remove the output files from comparison but add the restart files (making whatever change @JessicaMeixner-NOAA is suggesting to create the restart) and then go ahead and create the baselines on hera for the two failed tests. I've copied all the other baselines already, so once we add those two I can add the auto-RT label.

aliabdolali commented 2 years ago

@DeniseWorthen Both tests are updated and the base lines are in /scratch1/NCEPDEV/stmp4/Ali.Abdolali/FV3_RT/REGRESSION_TEST_INTEL/ Do I need to copy them or you will take care of them?

DeniseWorthen commented 2 years ago

I will copy and then add the auto-rt label for hera.intel.

DeniseWorthen commented 2 years ago

@aliabdolali Would you please commit a change in rt.sh. Your RT just passed so it won't impact this PR, but we want to set it temporarily to batch on hera.

diff --git a/tests/rt.sh b/tests/rt.sh
index f770a1d..9d93855 100755
--- a/tests/rt.sh
+++ b/tests/rt.sh
@@ -237,7 +237,7 @@ elif [[ $MACHINE_ID = hera.* ]]; then
   ECFLOW_START=/scratch1/NCEPDEV/nems/emc.nemspara/soft/miniconda3/bin/ecflow_start.sh
   ECF_PORT=$(( $(id -u) + 1500 ))

-  QUEUE=debug
+  QUEUE=batch
   COMPILE_QUEUE=batch
aliabdolali commented 2 years ago

@aliabdolali Would you please commit a change in rt.sh. Your RT just passed so it won't impact this PR, but we want to set it temporarily to batch on hera.

diff --git a/tests/rt.sh b/tests/rt.sh
index f770a1d..9d93855 100755
--- a/tests/rt.sh
+++ b/tests/rt.sh
@@ -237,7 +237,7 @@ elif [[ $MACHINE_ID = hera.* ]]; then
   ECFLOW_START=/scratch1/NCEPDEV/nems/emc.nemspara/soft/miniconda3/bin/ecflow_start.sh
   ECF_PORT=$(( $(id -u) + 1500 ))

-  QUEUE=debug
+  QUEUE=batch
   COMPILE_QUEUE=batch

Done

BrianCurtis-NOAA commented 2 years ago

Machine: gaea Compiler: intel Job: BL Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/661779909/20210624173009/ufs-weather-model Please manually delete: /lustre/f2/scratch/emc.nemspara/FV3_RT/rt_33278 Test control_c384gdas_wav 077 failed failed Test control_c384gdas_wav 077 failed in run_test failed Please make changes and add the following label back: gaea-intel-BL

DeniseWorthen commented 2 years ago

On gaea, the err file shows:

srun: error: nid01137: tasks 378-383,385-411: Killed
srun: Terminating job step 268839422.0
srun: error: nid01135: tasks 310-311: Killed
slurmstepd: error: *** STEP 268839422.0 ON nid00728 CANCELLED AT 2021-06-24T18:24:16 ***

I'm re-trying creating that baseline manually.

BrianCurtis-NOAA commented 2 years ago

Machine: cheyenne Compiler: intel Job: BL Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/661779909/20210624114512/ufs-weather-model Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_65121 Test control_c384gdas_wav 079 failed in run_test failed Please make changes and add the following label back: cheyenne-intel-BL

BrianCurtis-NOAA commented 2 years ago

Machine: cheyenne Compiler: gnu Job: BL Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/661779909/20210624113012/ufs-weather-model Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_20543 Baseline creation and move successful Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/661779909/20210624115814/ufs-weather-model Please manually delete: /glade/scratch/dtcufsrt/FV3_RT/rt_38934 Test control_thompson_debug 020 failed in run_test failed Test control_ras_debug 024 failed in run_test failed Please make changes and add the following label back: cheyenne-gnu-BL

DeniseWorthen commented 2 years ago

On Cheyenne, the control_ras_debug timed out w/o even starting up. The control_thompson_debug failed with:

137:mp_thompson_post_run: ttendlim applied 2/4064 times at timestep 1
144:
144:Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
144:
DeniseWorthen commented 2 years ago

@aliabdolali I was able to compile and run the control_c384gdas_wav on gaea but it timed out before the wave restarts were written. Are you sure you've set the resources correctly for all platforms? Can the test length be shortened ?

For Cheyenne Intel, I copied your run directory and resubmitted. The job again failed with the same error: MPT: shepherd terminated: r14i5n23.ib0.cheyenne.ucar.edu - job aborting

I believe that on Cheyenne, this usually indicates a system issue but I'm surprised running a second time would not have resolved that.

The Cheyenne gnu failure must be unrelated to your PR since we don't run ATMW with GNU.

aliabdolali commented 2 years ago

@DeniseWorthen I did not have access to Gaea, but we might need to adduct something for it in tests/tests/control_c384gdas_wav

export TASKS=480
if [[ $MACHINE_ID = cheyenne.* ]]; then
  export TPN=36
elif [[ $MACHINE_ID = hera.* ]]; then
  export TPN=20
  export THRD=2
elif [[ $MACHINE_ID = wcoss_cray ]]; then
  export TPN=6
else
  export TPN=12
fi

Do you have any suggestions?

BrianCurtis-NOAA commented 2 years ago

Machine: jet Compiler: intel Job: BL Repo location: /lfs4/HFIP/h-nems/emc.nemspara/autort/pr/661779909/20210624173014/ufs-weather-model Please manually delete: /lfs4/HFIP/h-nems/emc.nemspara/RT_RUNDIRS/emc.nemspara/FV3_RT/rt_81717 Test control_c384gdas_wav 078 failed failed Test control_c384gdas_wav 078 failed in run_test failed Please make changes and add the following label back: jet-intel-BL

DeniseWorthen commented 2 years ago

@aliabdolali I do not enough about the resources this test might require. I don't think platform-specific adjustments like the one you show are made in most tests. You can look in default_vars for the resources available on each platform.

I believe you said you have tested on both Orion and Hera. Have you been able to test on either dell-p3 or jet? We do not run the wave tests on wcoss-cray. @JessicaMeixner-NOAA do you have any insight to offer here?

JessicaMeixner-NOAA commented 2 years ago

So this test for the wave side is a little bit memory intensive so, running with 2 threads on all the machines is likely a good idea, especially if it's not running to completion. If you've optimized it and checked that you are in balance on both hera and orion, you can generally make good educated guesses based on the resources @DeniseWorthen pointed you to for the other machines.

DeniseWorthen commented 2 years ago

To summarize at this point, the control_c384gdas_wav baseline has not been created on Cheyenne.intel, Jet.intel and Gaea.intel but otherwise the develop-20210623 is fully populated.

No WCOSS has been run yet (but we don't run wave on cray).

Hera (intel,gnu), Cheyenne.gnu and Orion are done.

The control_c384gdas is not run on Gaea, so there may be a reason not to run the control_c384gdas_wav on Gaea.

JessicaMeixner-NOAA commented 2 years ago

If control_c384gdas is not run on Gaea, seems reasonable to not run control_c384gdas_wav on Gaea although I don't know why the atm-only is not run on gaea to know for sure. Since @aliabdolali does not have gaea access, I can try to run some tests to help figure that out, but that's not likely going to happen quickly as others will want in terms of holding up the queue.

DeniseWorthen commented 2 years ago

@aliabdolali Jun noticed that the job cards on gaea and jet may need to be changed to allow the nodes and tasks-per-node setting. WCOSS is open now, and I will transfer the new WW3 data and start the baseline creations there in the meantime.

JessicaMeixner-NOAA commented 2 years ago

@DeniseWorthen I can help with the gaea testing. I theoretically have access to jet but not sure if I have projects on there right now. Should I wait for updates to the gaea job card or go ahead and work on testing/optimizing it there? Did we ever find out why the control_c384gdas is not run on gaea?

DeniseWorthen commented 2 years ago

Thanks @JessicaMeixner-NOAA let's hold off for the time being. Jun has tests running on Gaea and Jet. I believe the existing control_c384gdas is not being run on Gaea for the same reason.

junwang-noaa commented 2 years ago

@DeniseWorthen @aliabdolali We need following changes on gaea: tests/fv3_conf/fv3_slurm.IN_gaea

 @@ -5,7 +5,8 @@
#SBATCH --account=@[ACCNR]
 #SBATCH --qos=@[QUEUE]
 #SBATCH --clusters=@[PARTITION]
-#SBATCH --ntasks=@[TASKS]
+#SBATCH --nodes=@[NODES]
+#SBATCH --ntasks-per-node=@[TPN]

@@ -28,7 +29,7 @@ export NC_BLKSZ=1M
 sync && sleep 1

-srun ./fv3.exe
+srun --label -n @[TASKS] ./fv3.exe

With the changes, you can turn on control_c384 and control_c384gdas and control_c384gdas_wav is running to completion.

junwang-noaa commented 2 years ago

Sorry, forgot the change in tests/default_vars.sh:

@@ -242,7 +242,7 @@ elif [[ $MACHINE_ID = gaea.* ]]; then

   TASKS_dflt=150 ; TPN_dflt=36 ; INPES_dflt=3 ; JNPES_dflt=8
   TASKS_thrd=78  ; TPN_thrd=18 ; INPES_thrd=3 ; JNPES_thrd=4
-  TASKS_c384=480 ; TPN_c384=36 ; INPES_c384=12 ; JNPES_c384=6
+  TASKS_c384=480 ; TPN_c384=18 ; INPES_c384=12 ; JNPES_c384=6
DeniseWorthen commented 2 years ago

@aliabdolali Please make the following change to rt.conf. This will turn on the control_c384gdas test on gaea and switch the two atm+wav tests to 32bit.

Switching to 32bit will mean we will need to create new baselines on orion and hera for the two affected tests because answers will change. Can you do that?

I will do gaea.

diff --git a/tests/rt.conf b/tests/rt.conf
index cbbf598..85777f7 100644
--- a/tests/rt.conf
+++ b/tests/rt.conf
@@ -44,7 +44,7 @@ RUN     | control_CubedSphereGrid
 RUN     | control_wrtGauss_netcdf_parallel                                                                                        |                                         | fv3 |
 RUN     | control_c192                                                                                                            |                                         | fv3 |
 RUN     | control_c384                                                                                                            | - gaea.intel wcoss_cray                 | fv3 |
-RUN     | control_c384gdas                                                                                                        | - gaea.intel wcoss_cray                 | fv3 |
+RUN     | control_c384gdas                                                                                                        | - wcoss_cray                            | fv3 |
 RUN     | control_stochy                                                                                                          |                                         | fv3 |
 RUN     | control_ca                                                                                                              |                                         | fv3 |
 RUN     | control_lndp                                                                                                            |                                         | fv3 |
@@ -185,6 +185,6 @@ RUN     | datm_cdeps_debug_cfsr
 # ATM-WAV tests                                                                                                                                                                   #
 ###################################################################################################################################################################################

-COMPILE | APP=ATMW SUITES=FV3_GFS_v16                                                                                           | - wcoss_cray  | fv3 |
+COMPILE | APP=ATMW SUITES=FV3_GFS_v16 32BIT=Y                                                                                   | - wcoss_cray  | fv3 |
 RUN     | control_atmwav                                                                                                        | - wcoss_cray  | fv3 |
 RUN     | control_c384gdas_wav                                                                                                  | - wcoss_cray  | fv3 |
junwang-noaa commented 2 years ago

Well, I had a typo in jet job card, I have to resubmit the tests. Below are the code changes, similar to those on gaea:

tests/fv3_conf/fv3_slurm.IN_jet
@@ -4,7 +4,8 @@
 #SBATCH --account=@[ACCNR]
 #SBATCH --partition=@[PARTITION]
 #SBATCH --qos=@[QUEUE]
-#SBATCH --ntasks=@[TASKS]
+#SBATCH --nodes=@[NODES]
+#SBATCH --ntasks-per-node=@[TPN]
 #SBATCH --time=@[WLCLK]
 #SBATCH --job-name="@[JBNME]"
 ### #SBATCH --exclusive
@@ -32,7 +33,7 @@ export PSM_SHAREDCONTEXTS=1
 # Avoid job errors because of filesystem synchronization delays
 sync && sleep 1

-srun ./fv3.exe
+srun --label -n @[TASKS] ./fv3.exe

tests/default_vars.sh
@@ -190,7 +190,7 @@ elif [[ $MACHINE_ID = jet.* ]]; then

   TASKS_dflt=150 ; TPN_dflt=24 ; INPES_dflt=3 ; JNPES_dflt=8
   TASKS_thrd=78  ; TPN_thrd=12 ; INPES_thrd=3 ; JNPES_thrd=4
-  TASKS_c384=480 ; TPN_c384=24 ; INPES_c384=12 ; JNPES_c384=6
+  TASKS_c384=480 ; TPN_c384=12 ; INPES_c384=12 ; JNPES_c384=6
aliabdolali commented 2 years ago

@DeniseWorthen here is the path to the new bases for two new tests with 32BIT=Y on hera _/scratch1/NCEPDEV/stmp4/Ali.Abdolali/FV3_RT/REGRESSION_TESTINTEL could you copy them over. orion is still running...

DeniseWorthen commented 2 years ago

yes I will copy.

DeniseWorthen commented 2 years ago

@aliabdolali Can you make this change for Cheyenne:

diff --git a/tests/tests/control_c384gdas_wav b/tests/tests/control_c384gdas_wav
index a51a5fe..0221b5c 100644
--- a/tests/tests/control_c384gdas_wav
+++ b/tests/tests/control_c384gdas_wav
@@ -56,7 +56,7 @@ export_fv3

 export TASKS=480
 if [[ $MACHINE_ID = cheyenne.* ]]; then
-  export TPN=36
+  export TPN=18
 elif [[ $MACHINE_ID = hera.* ]]; then
   export TPN=20
   export THRD=2
aliabdolali commented 2 years ago

@DeniseWorthen here are the bases on orion _/work/noaa/stmp/aabdolal/stmp/aabdolal/FV3_RT/REGRESSION_TESTINTEL