ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 24 forks source link

Test ERS_Lh11.C96.GFSv15p2.cheyenne_intel FAILS in restart comparison #62

Closed jedwards4b closed 4 years ago

jedwards4b commented 4 years ago

This test indicates that restarts are not producing bfb results under cime testing.

The file comparisons show: run/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.20200116_085748_nk2uyu.ufsatm.atm.f011.nc.base.cprnc.out: of which 14 had non-zero differences run/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.20200116_085748_nk2uyu.ufsatm.sfc.f011.nc.base.cprnc.out: of which 125 had non-zero differences

climbfuji commented 4 years ago

I am currently working on getting the restart tests into the rt.sh regression test system, also because this was reported independently in https://github.com/NOAA-EMC/fv3atm/issues/42. It would make sense to wait for the rt.sh based tests to be implemented before spending more time on this.

pjpegion commented 4 years ago

@climbfuji Please let me know if you want me to do anything.

climbfuji commented 4 years ago

I can confirm that with the namelist settings in the ufs_public_release branches for the GFS_v15p2 tests the restarts do not work. I am now trying to fix this, I've got a few ideas what may be the difference to the tests that we know are b4b reproducible in restart runs.

climbfuji commented 4 years ago

@jedwards4b @mcgibbon I have a solution for this (tested on my Mac for GFSv15p2 thus far). The default namelist settings for both GFSv15p2 and GFSv16beta in the ufs_public_release branch of the ufs-weather-model repository turn on skep, shum and sppt. The stochastic physics do not reproduce in restart runs, because the logic for dealing with restarts hasn't been implemented in the stochastic_physics repo (@pjpegion) and the model isn't writing those fields to the restart files (@DusanJovic-NOAA @junwang-noaa). My suggestion for the public release is to (a) turn off stochastic physics in the default namelists (Phil suggested this anyway, but I missed it) and (b) document that using the stochastic perturbations is an advanced feature that currently does not support b4b identical results through restarts (@ligiabernardet). For our development branches, we need to implement this capability in stochastic_physics and fv3atm in the near future. Any objections?

jedwards4b commented 4 years ago

I believe that we have already made this change for cime tests and it still fails.

On Fri, Jan 17, 2020, 09:10 Dom Heinzeller notifications@github.com wrote:

@jedwards4b https://github.com/jedwards4b @mcgibbon https://github.com/mcgibbon I have a solution for this (tested on my Mac for GFSv15p2 thus far). The default namelist settings for both GFSv15p2 and GFSv16beta in the ufs_public_release branch of the ufs-weather-model repository turn on skep, shum and sppt. The stochastic physics do not reproduce in restart runs, because the logic for dealing with restarts hasn't been implemented in the stochastic_physics repo (@pjpegion https://github.com/pjpegion) and the model isn't writing those fields to the restart files (@DusanJovic-NOAA https://github.com/DusanJovic-NOAA @junwang-noaa https://github.com/junwang-noaa). My suggestion for the public release is to (a) turn off stochastic physics in the default namelists (Phil suggested this anyway, but I missed it) and (b) document that using the stochastic perturbations is an advanced feature that currently does not support b4b identical results through restarts (@ligiabernardet https://github.com/ligiabernardet). For our development branches, we need to implement this capability in stochastic_physics and fv3atm in the near future. Any objections?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/62?email_source=notifications&email_token=ABOXUGEGY72BFCXT2OUQ7Q3Q6HJ7VA5CNFSM4KHW334KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIFJLI#issuecomment-575689901, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOXUGEMND3FHPQP7J7NJOTQ6HJ7VANCNFSM4KHW334A .

ligiabernardet commented 4 years ago

The default configurations for this release are with all stochastic processes turned off. @climbfuji: can you produce b4b restarts with stochastics off?

uturuncoglu commented 4 years ago

@jedwards4b Yes, i confirm that. We turned off stochastic physics but it was still not b4b.

climbfuji commented 4 years ago

On my Mac, I am getting b4b identical results w/o stochastic physics. Now testing on Cheyenne with Intel.

climbfuji commented 4 years ago

Just to make sure that you are modifying the nstf_name namelist entry as well for the restart runs? The usual regression tests for ufs-weather-model use 2,1,1,0,5 for coldstarts. When restarting, one needs to set the second 1 to 0 (that is the NSST spinup flag, one of the "hidden features" - don't blame me). The input.nml we got from EMC uses 2,1,0,0,0 for coldstarts. I am testing now if 2,0,0,0,0 works for restarts or if we need to switch to "2,1,1,0,5" and "2,0,1,0,5". Just be patient, please.

junwang-noaa commented 4 years ago

Dom,

This feature is not hidden, please see the document:

https://vlab.ncep.noaa.gov/redmine/projects/comfv3/wiki/_set_up_restart_run_for_FV3GFS_

On Fri, Jan 17, 2020 at 12:02 PM Dom Heinzeller notifications@github.com wrote:

Just to make sure that you are modifying the nstf_name namelist entry as well for the restart runs? The usual regression tests for ufs-weather-model use 2,1,1,0,5 for coldstarts. When restarting, one needs to set the second 1 to 0 (that is the NSST spinup flag, one of the "hidden features" - don't blame me). The input.nml we got from EMC uses 2,1,0,0,0 for coldstarts. I am testing now if 2,0,0,0,0 works for restarts or if we need to switch to "2,1,1,0,5" and "2,0,1,0,5". Just be patient, please.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/62?email_source=notifications&email_token=AI7D6TMPLTIO22KVHM5BQJDQ6HQCLA5CNFSM4KHW334KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIKOVQ#issuecomment-575711062, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TL2NJD5OWJYQTDICDDQ6HQCLANCNFSM4KHW334A .

climbfuji commented 4 years ago

I agree, Jun, it is not hidden to people who have access to Vlab. I am not sure if it is in the ufs-weather-model documentation for the release (I am lost wrt documentation) and I am not sure if the CIME folks know about it ... let's wait to hear from them!

uturuncoglu commented 4 years ago

@climbfuji I have already did it. My previous tests are on

Base run: /glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.base/run nstf_name = 2, 1, 0, 0, 0

Restart run: /glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.rest/run nstf_name = 2, 0, 0, 0, 0

By default, stochastic physics is off.

climbfuji commented 4 years ago

That's good to know, thanks. if that fails I will test the default 2,1,1,0,5 settings. Just wait, please.

mcgibbon commented 4 years ago

@climbfuji when you get things working, could you attach an input.nml which is working locally for you? I'd like to test it on my system. I would just ask which options disable skep, shum, and sppt but I can see those are disabled in my log file. I'd like to glance at whatever else might be different in my set-up.

climbfuji commented 4 years ago

Sure. But note everyone that I will be taking this weekend off (definitely Sunday and Monday), so please don't expect any answers before Tuesday. Thanks ...

climbfuji commented 4 years ago

Everyone, please see here https://github.com/NOAA-EMC/fv3atm/issues/42 for the solution/namelists/... Thanks!

jedwards4b commented 4 years ago

@climbfuji I tried the cime test with these changes, it still fails. I am using settings: nstf_name = 2, 1, 1, 0, 5
for the initial run and nstf_name = 2, 0, 1, 0, 5 for the restart run. I also updated the stochastic physics source.
My source tree is /glade/u/home/jedwards/sandboxes/ufs-mrweather-app and the test is in /glade/scratch/jedwards/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.20200119_103112_odlpjt

climbfuji commented 4 years ago

I don't think I have the time to look at the differences between your runs and mine today. Here is a copy of all the directories you need on Cheyenne:

/glade/work/heinzell/fv3/rundirs_for_cime_restart_issues/

You will be interested in the following directories:

fv3_ccpp_gfs_v15p2_prod # 0-48h fcst
fv3_ccpp_gfs_v15p2_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v15p2_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT

fv3_ccpp_gfs_v16beta_prod # 0-48h fcst
fv3_ccpp_gfs_v16beta_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v16beta_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT
climbfuji commented 4 years ago

I am beginning to wonder if this is related to the debug-run problems you have been seeing, i.e. the missing update to the ufs_release_v1.0 branch for chgres_cube from George Gayno and the missing compiler flags for the GNU compiler for this executable.

jedwards4b commented 4 years ago

This test is using the Intel compiler so I'm not sure what GNU would have to do with it. The biggest difference I see is that you are using the cubed_sphere_grid for output_grid and I am using gaussian_grid . I'm looking into this now.

climbfuji commented 4 years ago

The same tests passed with the GNU compilers as well. They are identical except the modules.fv3 files. I can rerun the tests on Cheyenne with GNU and keep the rundirs, but as I said the differences will be in modules.fv3 and in the actual model output.

uturuncoglu commented 4 years ago

@jedwards4b i tested with changing output_grid = 'cubed_sphere_grid' but the restart still fails. I'll try to find other possible differences between namelist files. I could also test by using input.nml and module_configure from following tables

fv3_ccpp_gfs_v15p2_prod # 0-48h fcst
fv3_ccpp_gfs_v15p2_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v15p2_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT

fv3_ccpp_gfs_v16beta_prod # 0-48h fcst
fv3_ccpp_gfs_v16beta_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v16beta_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT
uturuncoglu commented 4 years ago

@climbfuji I tested your input.nml with CIME build model for v15p2 and we have still difference in the restart. So, at least the problem is not related with input.nml. I'll continue to dig but let me know if you have any other idea. The runs are in

Base (48 hours): /glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.jan16/run.base2
Restart (24+24 hours): /glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.jan16/run.rest2
climbfuji commented 4 years ago

I can think of

I need to get this cime setup run by myself. Will try tomorrow.

uturuncoglu commented 4 years ago

The initial documentation is in

https://ufs-mrapp.readthedocs.io/en/latest/index.html#

I am still working on but i could find lots of information especially in quick start guide.

jedwards4b commented 4 years ago

@climbfuji I ran the cime restart test with your executable and it passed. This points to a difference in the build, perhaps in the build flags, but I also noticed that you were not using the latest model version: 6a93463

climbfuji commented 4 years ago

@climbfuji I ran the cime restart test with your executable and it passed. This points to a difference in the build, perhaps in the build flags, but I also noticed that you were not using the latest model version: 6a93463

Yes, the code I had used for the testing didn't include the last PR. But the current PR I have and for which I reran the restart tests does (https://github.com/ufs-community/ufs-weather-model/pull/33).

jedwards4b commented 4 years ago

I built using src/model/tests/compile_cmake.sh and it also passed the restart test - I've been studying the build since and still cannot pinpoint the difference.

climbfuji commented 4 years ago

If you send me build logs (cmake and make; may have to add VERBOSE=1 to the make calls) then I can take a look. Maybe something comes to my mind wrt which files to look at when I stare at this long enough. Thanks ...

jedwards4b commented 4 years ago
/glade/scratch/jedwards/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.try/bld/atm.bldlog.200121-200946.gz
jedwards4b commented 4 years ago

This problem is fixed. The build flags to libfv3core.a were different.

climbfuji commented 4 years ago

Yeah! Thanks for figuring this out, I was struggling all day to find time to look at your compile logs.

mcgibbon commented 4 years ago

Can you please elaborate on the fix @jedwards4b? I'm having the same issue with a different build system.

jedwards4b commented 4 years ago

@mcgibbon I found that the noaa build was using the flag -fp-model consistent but the cime build was using -fp-model source in the compilation of the fv3core library. Changing the cime compile to match the noaa compile solved the problem.