ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 24 forks source link

Test SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel failing #69

Closed jedwards4b closed 4 years ago

jedwards4b commented 4 years ago

We thought that #58 was solved with an update to chgres_cube, however the same test is still failing with 3 different tracebacks:

14:MPT: #6  gfdl_cloud_microphys::gfdl_cloud_microphys_run (
14:MPT:     levs=<error reading variable: Cannot access memory at address 0x3>, 
14:MPT:     im=<error reading variable: Cannot access memory at address 0x0>, con_g=0, 
14:MPT:     con_fvirt=0, con_rd=0, frland=..., garea=..., islmsk=..., gq0=..., 
14:MPT:     gq0_ntcw=..., gq0_ntrw=..., gq0_ntiw=..., gq0_ntsw=..., gq0_ntgl=..., 
14:MPT:     gq0_ntclamt=..., gt0=..., gu0=..., gv0=..., vvl=..., prsl=..., phii=..., 
14:MPT:     del=..., rain0=..., ice0=..., snow0=..., graupel0=..., prcp0=..., sr=..., 
14:MPT:     dtp=450, hydrostatic=.FALSE., phys_hydrostatic=4294967295, lradar=.FALSE., 
14:MPT:     refl_10cm=..., reset=4294967295, effr_in=4294967295, rew=..., rei=..., 
14:MPT:     rer=..., res=..., reg=..., errmsg=..., errflg=0, .tmp.ERRMSG.len_V$12a=512)
14:MPT:     at /glade/scratch/jedwards/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.20200123_114400_ntheci/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_cloud_microphys.F90:263
94:MPT: #6  hedmf::hedmf_run (ix=959008295, 
94:MPT:     im=<error reading variable: Cannot access memory at address 0x2>, km=-1, 
94:MPT:     ntrac=<error reading variable: Cannot access memory at address 0x14>, 
94:MPT:     ntcw=973450256, dv=..., du=..., tau=..., rtg=..., u1=..., v1=..., t1=..., 
94:MPT:     q1=..., swh=..., hlw=..., xmu=..., psk=..., rbsoil=..., zorl=..., 
94:MPT:     u10m=..., v10m=..., fm=..., fh=..., tsea=..., heat=..., evap=..., 
94:MPT:     stress=..., spd1=..., kpbl=..., prsi=..., del=..., prsl=..., prslk=..., 
94:MPT:     phii=..., phil=..., delt=450, dspheat=4294967295, dusfc=..., dvsfc=..., 
94:MPT:     dtsfc=..., dqsfc=..., hpbl=..., hgamt=..., hgamq=..., dkt=..., kinver=..., 
94:MPT:     xkzm_m=1, xkzm_h=1, xkzm_s=1, lprnt=.FALSE., ipr=10, 
94:MPT:     xkzminv=0.29999999999999999, moninq_fac=1, errmsg=..., errflg=0, 
94:MPT:     .tmp.ERRMSG.len_V$f8=512)
94:MPT:     at /glade/scratch/jedwards/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.20200123_114400_ntheci/bld/atm/obj/FV3/ccpp/physics/physics/moninedmf.f:511
41:MPT: #6  0x0000000002e1371e in module_radiation_astronomy::coszmn (xlon=..., 
41:MPT:     sinlat=<error reading variable: Cannot access memory at address 0x60>, 
41:MPT:     coslat=<error reading variable: Cannot access memory at address 0x60>, 
41:MPT:     solhr=<error reading variable: Cannot access memory at address 0x12>, 
41:MPT:     im=<error reading variable: Cannot access memory at address 0x8>, 
41:MPT:     me=<error reading variable: Cannot access memory at address 0x8>, 
41:MPT:     coszen=..., coszdg=...)
41:MPT:     at /glade/scratch/jedwards/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.20200123_114400_ntheci/bld/atm/obj/FV3/ccpp/physics/physics/radiation_astronomy.f:901
jedwards4b commented 4 years ago

The test SMS_Lh3_D.C96.GFSv15p2.cheyenne_gnu passes - we used the same input files generated by chgres_cube from this test in the intel test and it still fails indicating that this is perhaps a model issue and not a chgres_cube issue. This test also fails on stampede at: moduleradiation 901 radiation_astronomy.f

arunchawla-NOAA commented 4 years ago

@pjpegion, @climbfuji @mark-a-potts @llpcarson

Can you take a look and see what is happening here?

pjpegion commented 4 years ago

I'm looking into it.

uturuncoglu commented 4 years ago

@pjpegion Just for your information, I placed a print statement FV3/ccpp/physics/physics/radiation_astronomy.f because it was giving error as following

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
ufs.exe            0000000004E0E53F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B8BDADEA5D0  Unknown               Unknown  Unknown
ufs.exe            0000000002CBEF5E  module_radiation_         901  radiation_astronomy.f
ufs.exe            0000000002B8F005  gfs_rrtmg_pre_mp_         319  GFS_rrtmg_pre.F90
ufs.exe            0000000002B043AB  ccpp_fv3_gfs_v16b         112  ccpp_FV3_GFS_v16beta_radiation_cap.F90
ufs.exe            0000000002AF677F  ccpp_static_api_m         147  ccpp_static_api.F90
ufs.exe            0000000002AFC165  ccpp_driver_mp_cc         234  CCPP_driver.F90
ufs.exe            0000000000635A43  atmos_model_mod_m         338  atmos_model.F90
ufs.exe            00000000006292D3  module_fcst_grid_         708  module_fcst_grid_comp.F90

It seems that the operation is protected to divide to zero error but it fails anyway. The values for istsun is changing between 0-8.

pjpegion commented 4 years ago

@uturuncoglu are you getting this tracback on cheyenne or stampede? I'm getting something much more cryptic on cheyenne: MPT: header=header@entry=0x7ffde901cc00 "MPT ERROR: Rank 95(g:95) received signal SIGFPE(8).\n\tProcess ID: 6654, Host: r2i4n15, Program: /glade/scratch/pegion/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.try/bld/ufs.exe\n\tMPT Version: HPE MPT 2.19 0"...) at sig.c:340"

uturuncoglu commented 4 years ago

It is on Stampede but when I run the model again I got following. So, i think, it is not predictable.

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
ufs.exe            0000000004E0E52F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B76678E75D0  Unknown               Unknown  Unknown
ufs.exe            0000000002BF5AC5  satmedmfvdifq_mp_         624  satmedmfvdifq.F
ufs.exe            0000000002B3B6FB  ccpp_fv3_gfs_v16b         974  ccpp_FV3_GFS_v16beta_physics_cap.F90
ufs.exe            0000000002AF6A19  ccpp_static_api_m         150  ccpp_static_api.F90
ufs.exe            0000000002AFC165  ccpp_driver_mp_cc         234  CCPP_driver.F90
ufs.exe            0000000000635F79  atmos_model_mod_m         364  atmos_model.F90
ufs.exe            00000000006292D3  module_fcst_grid_         708  module_fcst_grid_comp.F90
libesmf.so         00002B76638FA181  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         00002B76638FDA8F  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         00002B7663DC1165  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         00002B76638FB64A  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         00002B7663FEA41D  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         00002B76641DDEFF  esmf_gridcompmod_     Unknown  Unknown
ufs.exe            0000000000606419  fv3gfs_cap_mod_mp         999  fv3_cap.F90
uturuncoglu commented 4 years ago

If you have access to Stampede, it might help to find the source of the problem.

pjpegion commented 4 years ago

I don't have an account there, so not sure how much help I can be. I will do what I can on cheyenne since the model fails there in debug mode.

uturuncoglu commented 4 years ago

What about the FV3 if you compile and run it outside of the CIME? Is it failing with the same way in debug mode?

pjpegion commented 4 years ago

I ran it outside of CIME and I get the same error. (also ran the debug executable in the directory of a successful run and it fails, which points to the model and nothing in the run setup) Haven't tried to compile it outside of CIME yet. I will try that next.

arunchawla-NOAA commented 4 years ago

Adding @junwang-noaa @DusanJovic-NOAA and @climbfuji so that they are aware of this issue

climbfuji commented 4 years ago

Recommend looking at compiler options and, more likely, initial conditions. The test runs fine with the regression test input data (i.e. using rt.sh) on Cheyenne (GNU, Intel) and Hera (Intel) in PROD, REPRO and DEBUG mode.

If I find time I will take a look

pjpegion commented 4 years ago

@uturuncoglu compiling outside of CIME with Debug on the model runs to completion.
@climbfuji can you tell me where in CIME the compiler flags are set?
Thanks.

jedwards4b commented 4 years ago

@pjpegion In cime you can examine the file bld/atm.bldlog.*.gz to see the compiler flags used.
When you compile outside of cime are you using the same initial conditions as those generated in the cime case?

DusanJovic-NOAA commented 4 years ago

If you are using Intel compiler all compiler flags are set in cmake/Intel.cmake. For gnu compiler they are set in cmake/GNU.cmake.

pjpegion commented 4 years ago

@DusanJovic-NOAA Thanks

uturuncoglu commented 4 years ago

It is better to clarify that the executable created outside of CIME is failing with CIME generated (using chgres) initial condition or not? We had a problem with chgres before (see #58) and it is fixed in NCEPLIBS side but there might be still issue related with chgres.

pjpegion commented 4 years ago

@uturuncoglu I can run the SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel case with the model compiled outside of CIME.

jedwards4b commented 4 years ago

@DusanJovic-NOAA The cime buid does not use the flags in those files. Cime compiler flags are defined in cime/config/ufs/machines/config_compilers.xml

DusanJovic-NOAA commented 4 years ago

@jedwards4b Thanks. I didn't know that. I wonder how those flags are passed from CIME to ufs-weather-model's cmake build.

jedwards4b commented 4 years ago

@DusanJovic-NOAA It's a little convoluted: A file is created in the case directory called Macros.cmake There is also a file in FV3/cime/cime_config/configure_cime.cmake that includes that Macros file and translates the variable names as set in cime to those expected by ufs_weather_model. That configure_cime.cmake file is copied to the src/model/cmake directory and used by the model cmake build.

jedwards4b commented 4 years ago

I did find a difference in that cime is using the mkl library while the noaa build is not, but I turned off mkl in my sandbox and rebuilt - it still fails in the same way.

jedwards4b commented 4 years ago

@DusanJovic-NOAA It may also be of interest to note that the ccpp physics package ignores any flags set by cime or by ufs-weather-model and sets it's own.

climbfuji commented 4 years ago

Yes. This is a "known issue"/"feature". The flags have been set such that the ccpp-physics code gives b4b identical results with the previous IPD physics code (in what we called REPRO mode) or, more generally, that the ccpp-physics are compiled with exactly the same flags as the previous IPD physics code (in DEBUG, REPRO and PROD mode). If the CIME flags are different, then they are most likely incorrect because not tested/vetted by the ufs-weather-model. If we need to accommodate other SIMD instruction sets, then please let us know and we will make this work.

jedwards4b commented 4 years ago

I think I've solved the problem. I had to remove the debug flags -ftrapuv -fpe0 . I submit that this does not indicate that the CIME flags are incorrect, rather it indicates that there are questionable floating point values in the model and removing these flags avoids trapping them. I'll run the full set of tests overnight and update the issue in the morning.

climbfuji commented 4 years ago

Wohoo. I take your point, but please note that the DEBUG flags we use (see https://github.com/ufs-community/ufs-weather-model/blob/52795b83f0febae0fe030d5cb1da3e5bbafba5e8/cmake/Intel.cmake#L36 for the develop branch, and https://github.com/ufs-community/ufs-weather-model/blob/2487a7b9736b516b5c1faba6f4f88bf3e7b82053/cmake/Intel.cmake#L36 for the ufs_public_release branch) do contain "-ftrapuv -fpe0". And the regression tests for GFS_v15p2 and GFS_v16beta do pass in DEBUG mode (for 24h forecasts), see https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf for the regression testing config. Does this mean the ball is back in the "initital conditions" court?

Is it possible for you to use the initial conditions we use for the regression tests (i.e. bypass chgres_cube and only run the model using those ICs)?

jedwards4b commented 4 years ago

@climbfuji I was using compile_cmake.sh with REPRO=Y DEBUG=Y for comparison and I see from your link that REPRO overrides DEBUG so I wasn't getting the ftrapuv and fpe0 flags in your build. So I rebuilt with REPRO=N and your build runs with those flags so I think I'm back to square one.

jedwards4b commented 4 years ago

But that lead to the solution because I was also setting both flags in the cime build. So now with DEBUG on (and ftrapuv and fpe0 included) and REPRO off the tests are passing. (In CIME the combination had a different effect than in the noaa build - in the noaa build combining the flags turned off the debug flags, but in cime the debug flags were on but ccpp was built with CMAKE_BUILD_TYPE=Repro instead of CMAKE_BUILD_TYPE=DEBUG).

climbfuji commented 4 years ago

Wow. Good job. I thought we had added a guard in compile_cmake.sh that would prevent setting both of them to true. If not, we should do that (and you the same in CIME in case the user can control that).

pjpegion commented 4 years ago

@jedwards4b can you let me know what you changed so I can test it? Thanks, Phil

rsdunlapiv commented 4 years ago

Just to check my understanding - in REPRO mode CCPP does not pass with the floating point debug checks on. This indicates that there actually is some underlying floating point issue in that mode, and that implies that it was a preexisting problem with IPD but it was important to reproduce the exact same behavior in CCPP for validation purposes. Is this correct?

So, what is the future of the REPRO flag moving forward? Was that something that only needed for a period to validate CCPP? Will future releases remove this option entirely?

It is too late to resolve any floating point problems now, so will we list in the "known bugs" that this issue exists and should be expected?

Is it also true that with REPRO=off and DEBUG=true that all tests pass? In other words, when CCPP is not forced to reproduce the old IPD behavior, the floating point problems are actually resolved?

climbfuji commented 4 years ago

This is way too complicated to reply to in a GitHub issue. Bottom line is that your assumption is not correct. Mixing REPRO and DEBUG flags doesn't make any sense. You can use REPRO=N DEBUG=N (or omit entirely, because these are the defaults) to get the PROD flags, REPRO=Y DEBUG=N (again, can omit DEBUG=N) to get the REPRO flags, or REPRO=N DEBUG=Y (again, can omit REPRO-N) to get the REPRO flags. For each of those three, the tests run and pass (using our regression testing config - and I believe the same is true for CIME, please confirm).

On Jan 30, 2020, at 8:29 AM, Rocky Dunlap notifications@github.com wrote:

Just to check my understanding - in REPRO mode CCPP does not pass with the floating point debug checks on. This indicates that there actually is some underlying floating point issue in that mode, and that implies that it was a preexisting problem with IPD and so it was important to reproduce the exact same behavior in CCPP for validation purposes. Is this correct?

So, what is the future of the REPRO flag moving forward? Was that something that only needed for a period to validate CCPP? Will future releases remove this option entirely?

It is too late to resolve any floating point problems now, so will we list in the "known bugs" that this issue exists and should be expected?

Is it also true that with REPRO=off and DEBUG=true that all tests pass? In other words, when CCPP is not forced to reproduce the old IPD behavior, the floating point problems are actually resolved?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/69?email_source=notifications&email_token=AB5C2RIMNKYU2TGFKIYB6ILRALW5VA5CNFSM4KK4IIDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKLMUJI#issuecomment-580307493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RL5KVW7YIORS77UGHLRALW5VANCNFSM4KK4IIDA.

jedwards4b commented 4 years ago

@pjpegion I'll update the ufs_mrweather and let you know.

@rsdunlapiv what @climbfuji says is correct - but I believe that to have ccpp set it's own flags independent of the flags set for ufs_weather or cime is a problem. We need to be able to build the entire model from a consistent set of compiler flags defined in a central location.

rsdunlapiv commented 4 years ago

@climbfuji thanks for clarifying - I guess since it is a complex issue, the bigger picture question is what does the end user need to be aware of and what is considered a technical detail to be managed by the workflow and build teams? In other words, does anyone really need to know about the REPRO/DEBUG combinations at the user level? If so then we'd want to try to document the details in a understandable way. But, if this is really a esoteric thing, maybe we just make sure the flags are consistent whether they are set through CIME or the model build - but the user really doesn't need to mess with it. Thoughts?

climbfuji commented 4 years ago

First thing I will do is to check if there is a guard in compile_cmake.sh or not. If both DEBUG=Y and REPRO=Y are set, the script should return an error and not overwrite one or the other. I think the user needs to know about DEBUG=Y/N, but not about REPRO (this is only for testing CCPP against IPD).

rsdunlapiv commented 4 years ago

@climbfuji glad to hear that REPRO is not user facing (I think it would be hard to explain this to a general audience). So, REPRO will be handled internally. I agree that DEBUG mode is a user-facing option and they should be aware of how to activate it.

DusanJovic-NOAA commented 4 years ago

Users also do not need to know anything about two compile scripts in the tests directory. Those scripts are internal to regression test and must be left undocumented. We will be changing them as we need to support various regression test requirements. The only supported way of building ufs-weather-model is build.sh script in the top-level directory. Which is what is documented here:

https://ufs-mr-weather-app.readthedocs.io/projects/ufs-weather-model/en/latest/CompilingCodeWithoutApp.html

climbfuji commented 4 years ago

Let's close the issue once the guard has been added to compile_cmake.sh.

jedwards4b commented 4 years ago

All tests are now passing on cheyenne with intel and gnu. @pjpegion if you would like to test again the head of ufs-mrweather-app master (hash c21d2860) has all the externals up to date.

jedwards4b commented 4 years ago

@pjpegion I made a mistake in updating ufs-mrweather-app, the corrected hash is 49f3b54.

arunchawla-NOAA commented 4 years ago

@jedwards4b is this ticket closed now?

jedwards4b commented 4 years ago

Yes