Closed jedwards4b closed 4 years ago
The test SMS_Lh3_D.C96.GFSv15p2.cheyenne_gnu passes - we used the same input files generated by chgres_cube from this test in the intel test and it still fails indicating that this is perhaps a model issue and not a chgres_cube issue. This test also fails on stampede at: moduleradiation 901 radiation_astronomy.f
@pjpegion, @climbfuji @mark-a-potts @llpcarson
Can you take a look and see what is happening here?
I'm looking into it.
@pjpegion Just for your information, I placed a print statement FV3/ccpp/physics/physics/radiation_astronomy.f
because it was giving error as following
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
ufs.exe 0000000004E0E53F Unknown Unknown Unknown
libpthread-2.17.s 00002B8BDADEA5D0 Unknown Unknown Unknown
ufs.exe 0000000002CBEF5E module_radiation_ 901 radiation_astronomy.f
ufs.exe 0000000002B8F005 gfs_rrtmg_pre_mp_ 319 GFS_rrtmg_pre.F90
ufs.exe 0000000002B043AB ccpp_fv3_gfs_v16b 112 ccpp_FV3_GFS_v16beta_radiation_cap.F90
ufs.exe 0000000002AF677F ccpp_static_api_m 147 ccpp_static_api.F90
ufs.exe 0000000002AFC165 ccpp_driver_mp_cc 234 CCPP_driver.F90
ufs.exe 0000000000635A43 atmos_model_mod_m 338 atmos_model.F90
ufs.exe 00000000006292D3 module_fcst_grid_ 708 module_fcst_grid_comp.F90
It seems that the operation is protected to divide to zero error but it fails anyway. The values for istsun
is changing between 0-8.
@uturuncoglu are you getting this tracback on cheyenne or stampede? I'm getting something much more cryptic on cheyenne: MPT: header=header@entry=0x7ffde901cc00 "MPT ERROR: Rank 95(g:95) received signal SIGFPE(8).\n\tProcess ID: 6654, Host: r2i4n15, Program: /glade/scratch/pegion/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.try/bld/ufs.exe\n\tMPT Version: HPE MPT 2.19 0"...) at sig.c:340"
It is on Stampede but when I run the model again I got following. So, i think, it is not predictable.
forrtl: error (65): floating invalid
Image PC Routine Line Source
ufs.exe 0000000004E0E52F Unknown Unknown Unknown
libpthread-2.17.s 00002B76678E75D0 Unknown Unknown Unknown
ufs.exe 0000000002BF5AC5 satmedmfvdifq_mp_ 624 satmedmfvdifq.F
ufs.exe 0000000002B3B6FB ccpp_fv3_gfs_v16b 974 ccpp_FV3_GFS_v16beta_physics_cap.F90
ufs.exe 0000000002AF6A19 ccpp_static_api_m 150 ccpp_static_api.F90
ufs.exe 0000000002AFC165 ccpp_driver_mp_cc 234 CCPP_driver.F90
ufs.exe 0000000000635F79 atmos_model_mod_m 364 atmos_model.F90
ufs.exe 00000000006292D3 module_fcst_grid_ 708 module_fcst_grid_comp.F90
libesmf.so 00002B76638FA181 _ZN5ESMCI6FTable1 Unknown Unknown
libesmf.so 00002B76638FDA8F ESMCI_FTableCallE Unknown Unknown
libesmf.so 00002B7663DC1165 _ZN5ESMCI2VM5ente Unknown Unknown
libesmf.so 00002B76638FB64A c_esmc_ftablecall Unknown Unknown
libesmf.so 00002B7663FEA41D esmf_compmod_mp_e Unknown Unknown
libesmf.so 00002B76641DDEFF esmf_gridcompmod_ Unknown Unknown
ufs.exe 0000000000606419 fv3gfs_cap_mod_mp 999 fv3_cap.F90
If you have access to Stampede, it might help to find the source of the problem.
I don't have an account there, so not sure how much help I can be. I will do what I can on cheyenne since the model fails there in debug mode.
What about the FV3 if you compile and run it outside of the CIME? Is it failing with the same way in debug mode?
I ran it outside of CIME and I get the same error. (also ran the debug executable in the directory of a successful run and it fails, which points to the model and nothing in the run setup) Haven't tried to compile it outside of CIME yet. I will try that next.
Adding @junwang-noaa @DusanJovic-NOAA and @climbfuji so that they are aware of this issue
Recommend looking at compiler options and, more likely, initial conditions. The test runs fine with the regression test input data (i.e. using rt.sh) on Cheyenne (GNU, Intel) and Hera (Intel) in PROD, REPRO and DEBUG mode.
If I find time I will take a look
@uturuncoglu compiling outside of CIME with Debug on the model runs to completion.
@climbfuji can you tell me where in CIME the compiler flags are set?
Thanks.
@pjpegion In cime you can examine the file bld/atm.bldlog.*.gz to see the compiler flags used.
When you compile outside of cime are you using the same initial conditions as those generated in the cime case?
If you are using Intel compiler all compiler flags are set in cmake/Intel.cmake. For gnu compiler they are set in cmake/GNU.cmake.
@DusanJovic-NOAA Thanks
It is better to clarify that the executable created outside of CIME is failing with CIME generated (using chgres) initial condition or not? We had a problem with chgres before (see #58) and it is fixed in NCEPLIBS side but there might be still issue related with chgres.
@uturuncoglu I can run the SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel case with the model compiled outside of CIME.
@DusanJovic-NOAA The cime buid does not use the flags in those files. Cime compiler flags are defined in cime/config/ufs/machines/config_compilers.xml
@jedwards4b Thanks. I didn't know that. I wonder how those flags are passed from CIME to ufs-weather-model's cmake build.
@DusanJovic-NOAA It's a little convoluted: A file is created in the case directory called Macros.cmake There is also a file in FV3/cime/cime_config/configure_cime.cmake that includes that Macros file and translates the variable names as set in cime to those expected by ufs_weather_model. That configure_cime.cmake file is copied to the src/model/cmake directory and used by the model cmake build.
I did find a difference in that cime is using the mkl library while the noaa build is not, but I turned off mkl in my sandbox and rebuilt - it still fails in the same way.
@DusanJovic-NOAA It may also be of interest to note that the ccpp physics package ignores any flags set by cime or by ufs-weather-model and sets it's own.
Yes. This is a "known issue"/"feature". The flags have been set such that the ccpp-physics code gives b4b identical results with the previous IPD physics code (in what we called REPRO mode) or, more generally, that the ccpp-physics are compiled with exactly the same flags as the previous IPD physics code (in DEBUG, REPRO and PROD mode). If the CIME flags are different, then they are most likely incorrect because not tested/vetted by the ufs-weather-model. If we need to accommodate other SIMD instruction sets, then please let us know and we will make this work.
I think I've solved the problem. I had to remove the debug flags -ftrapuv -fpe0 . I submit that this does not indicate that the CIME flags are incorrect, rather it indicates that there are questionable floating point values in the model and removing these flags avoids trapping them. I'll run the full set of tests overnight and update the issue in the morning.
Wohoo. I take your point, but please note that the DEBUG flags we use (see https://github.com/ufs-community/ufs-weather-model/blob/52795b83f0febae0fe030d5cb1da3e5bbafba5e8/cmake/Intel.cmake#L36 for the develop branch, and https://github.com/ufs-community/ufs-weather-model/blob/2487a7b9736b516b5c1faba6f4f88bf3e7b82053/cmake/Intel.cmake#L36 for the ufs_public_release branch) do contain "-ftrapuv -fpe0". And the regression tests for GFS_v15p2 and GFS_v16beta do pass in DEBUG mode (for 24h forecasts), see https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf for the regression testing config. Does this mean the ball is back in the "initital conditions" court?
Is it possible for you to use the initial conditions we use for the regression tests (i.e. bypass chgres_cube and only run the model using those ICs)?
@climbfuji I was using compile_cmake.sh with REPRO=Y DEBUG=Y for comparison and I see from your link that REPRO overrides DEBUG so I wasn't getting the ftrapuv and fpe0 flags in your build. So I rebuilt with REPRO=N and your build runs with those flags so I think I'm back to square one.
But that lead to the solution because I was also setting both flags in the cime build. So now with DEBUG on (and ftrapuv and fpe0 included) and REPRO off the tests are passing. (In CIME the combination had a different effect than in the noaa build - in the noaa build combining the flags turned off the debug flags, but in cime the debug flags were on but ccpp was built with CMAKE_BUILD_TYPE=Repro instead of CMAKE_BUILD_TYPE=DEBUG).
Wow. Good job. I thought we had added a guard in compile_cmake.sh that would prevent setting both of them to true. If not, we should do that (and you the same in CIME in case the user can control that).
@jedwards4b can you let me know what you changed so I can test it? Thanks, Phil
Just to check my understanding - in REPRO mode CCPP does not pass with the floating point debug checks on. This indicates that there actually is some underlying floating point issue in that mode, and that implies that it was a preexisting problem with IPD but it was important to reproduce the exact same behavior in CCPP for validation purposes. Is this correct?
So, what is the future of the REPRO flag moving forward? Was that something that only needed for a period to validate CCPP? Will future releases remove this option entirely?
It is too late to resolve any floating point problems now, so will we list in the "known bugs" that this issue exists and should be expected?
Is it also true that with REPRO=off and DEBUG=true that all tests pass? In other words, when CCPP is not forced to reproduce the old IPD behavior, the floating point problems are actually resolved?
This is way too complicated to reply to in a GitHub issue. Bottom line is that your assumption is not correct. Mixing REPRO and DEBUG flags doesn't make any sense. You can use REPRO=N DEBUG=N (or omit entirely, because these are the defaults) to get the PROD flags, REPRO=Y DEBUG=N (again, can omit DEBUG=N) to get the REPRO flags, or REPRO=N DEBUG=Y (again, can omit REPRO-N) to get the REPRO flags. For each of those three, the tests run and pass (using our regression testing config - and I believe the same is true for CIME, please confirm).
On Jan 30, 2020, at 8:29 AM, Rocky Dunlap notifications@github.com wrote:
Just to check my understanding - in REPRO mode CCPP does not pass with the floating point debug checks on. This indicates that there actually is some underlying floating point issue in that mode, and that implies that it was a preexisting problem with IPD and so it was important to reproduce the exact same behavior in CCPP for validation purposes. Is this correct?
So, what is the future of the REPRO flag moving forward? Was that something that only needed for a period to validate CCPP? Will future releases remove this option entirely?
It is too late to resolve any floating point problems now, so will we list in the "known bugs" that this issue exists and should be expected?
Is it also true that with REPRO=off and DEBUG=true that all tests pass? In other words, when CCPP is not forced to reproduce the old IPD behavior, the floating point problems are actually resolved?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/69?email_source=notifications&email_token=AB5C2RIMNKYU2TGFKIYB6ILRALW5VA5CNFSM4KK4IIDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKLMUJI#issuecomment-580307493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RL5KVW7YIORS77UGHLRALW5VANCNFSM4KK4IIDA.
@pjpegion I'll update the ufs_mrweather and let you know.
@rsdunlapiv what @climbfuji says is correct - but I believe that to have ccpp set it's own flags independent of the flags set for ufs_weather or cime is a problem. We need to be able to build the entire model from a consistent set of compiler flags defined in a central location.
@climbfuji thanks for clarifying - I guess since it is a complex issue, the bigger picture question is what does the end user need to be aware of and what is considered a technical detail to be managed by the workflow and build teams? In other words, does anyone really need to know about the REPRO/DEBUG combinations at the user level? If so then we'd want to try to document the details in a understandable way. But, if this is really a esoteric thing, maybe we just make sure the flags are consistent whether they are set through CIME or the model build - but the user really doesn't need to mess with it. Thoughts?
First thing I will do is to check if there is a guard in compile_cmake.sh or not. If both DEBUG=Y and REPRO=Y are set, the script should return an error and not overwrite one or the other. I think the user needs to know about DEBUG=Y/N, but not about REPRO (this is only for testing CCPP against IPD).
@climbfuji glad to hear that REPRO is not user facing (I think it would be hard to explain this to a general audience). So, REPRO will be handled internally. I agree that DEBUG mode is a user-facing option and they should be aware of how to activate it.
Users also do not need to know anything about two compile scripts in the tests directory. Those scripts are internal to regression test and must be left undocumented. We will be changing them as we need to support various regression test requirements. The only supported way of building ufs-weather-model is build.sh script in the top-level directory. Which is what is documented here:
Let's close the issue once the guard has been added to compile_cmake.sh.
All tests are now passing on cheyenne with intel and gnu. @pjpegion if you would like to test again the head of ufs-mrweather-app master (hash c21d2860) has all the externals up to date.
@pjpegion I made a mistake in updating ufs-mrweather-app, the corrected hash is 49f3b54.
@jedwards4b is this ticket closed now?
Yes
We thought that #58 was solved with an update to chgres_cube, however the same test is still failing with 3 different tracebacks: