Closed rsdunlapiv closed 4 years ago
@rsdunlapiv I believe that is just the place where an un-physical temperature triggers the array index to go out of bounds (it is associated with the calculation of the saturation specific humidity.)
Dusan merged an update to the ufs_public_release earlier today, part of the update was to address regression test failures in debug mode and to enable those tests for both 15p2 and 16beta as standard regression tests. These tests passed on Cheyenne with Intel and GNU and on Hera with Intel; they are based on the C96 configurations. See https://github.com/ufs-community/ufs-weather-model/pull/25 and https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf. Many questions: resolution, setup (namelist etc.), initial conditions? Can you point me to the run directory on Cheyenne, please?
Some of these questions are answered in the cime test name: SMS_D.C96.GFSv15p2.cheyenne_intel The resolution is C96 The CCPP is v15p2 Machine is cheyenne Compiler is intel
Initial conditions are 2019-09-09 00 Case directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp Run directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp/run
I am using commit bde62f9116cc9bdebaae0c6057090fe468eae917 Author: Dom Heinzeller climbfuji@ymail.com Date: Mon Jan 13 06:46:37 2020 -0700
Which is the latest available.
I get a similar error on stampede using v16beta at C96 resolution:
forrtl: error (65): floating invalid
Image PC Routine Line Source
ufs.exe 0000000004DC4E6F Unknown Unknown Unknown
libpthread-2.17.s 00002AD70FCA05D0 Unknown Unknown Unknown
ufs.exe 0000000002C9304B Unknown Unknown Unknown
ufs.exe 0000000002B48CA1 Unknown Unknown Unknown
ufs.exe 0000000002AA9A85 Unknown Unknown Unknown
ufs.exe 0000000002A9A8D3 ccpp_static_api_m 147 ccpp_static_api.F90
ufs.exe 0000000002AA02B9 ccpp_driver_mp_cc 234 CCPP_driver.F90
ufs.exe 000000000063684F atmos_model_mod_m 338 atmos_model.F90
ufs.exe 000000000062A0DE module_fcstgrid 707 module_fcst_grid_comp.F90
On stampede using v15p2 at C96:
floating divide by zero
Image PC Routine Line Source
ufs.exe 0000000004DBBC8F Unknown Unknown Unknown
libpthread-2.17.s 00002AFB84DDD5D0 Unknown Unknown Unknown
ufs.exe 0000000002CCDE5E Unknown Unknown Unknown
ufs.exe 0000000002C22E39 Unknown Unknown Unknown
ufs.exe 0000000002B74C17 Unknown Unknown Unknown
ufs.exe 0000000002AA3D13 Unknown Unknown Unknown
ufs.exe 0000000002A9A72F ccpp_static_api_m 145 ccpp_static_api.F90
ufs.exe 0000000002A9F502 ccpp_driver_mp_cc 197 CCPP_driver.F90
ufs.exe 0000000000634AB0 atmos_model_mod_m 295 atmos_model.F90
ufs.exe 000000000062A0DE module_fcstgrid 707 module_fcst_grid_comp.F90
I am at AMS this week and don't have much time to look into this. The easiest way forward imo is to compare the run directory (input files, namelist, ...) of your CIME setup to the ufs-weather-model regression test setup (which uses rt.sh to run and which completes successfully) on Cheyenne using the Intel compiler. I can point you to a directory containing a successful run if that helps.
Please point me to a successful run with debug flags enabled and I will compare.
Jim, see
/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/
/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v16beta_debug_prod/
These are C96 test cases as in your CIME setup, and both run to completion for a 6h forecast when the model is compiled with DEBUG=Y
.
@pjpegion I am ready to enlist your help. Instructions for running the tests on cheyenne are here: https://docs.google.com/document/d/13nvpIS_q87ttjjHwB9f8OFXX7YI00DM5O9V-gk04yAY/edit?usp=sharing
@llpcarson @julieschramm Let's run this test on Cheyenne and use it to review (and update, if needed) the Weather Model User's Guide on the directory structure and lists of input/output files. Keep in mind that this run uses CIME, the WM UG should be relevant to those using CIME, as well as to those running the model in other ways.
Did the comparison with the run directories that I gave @jedwards4b lead to any insight? I am not sure it makes sense to have more people try to run and debug this unless we understand why the regression tests in the ufs-weather-model run to completion in DEBUG mode while the CIME runs don't.
I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to.
I think that it does make sense to have @pjpegion and @ligiabernardet and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure.
Ok, thanks or the info. I am happy to take a look as well.
On Jan 15, 2020, at 8:50 AM, jedwards4b notifications@github.com wrote:
I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to.
I think that it does make sense to have @pjpegion https://github.com/pjpegion and @ligiabernardet https://github.com/ligiabernardet and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/58?email_source=notifications&email_token=AB5C2RIGFWHEHE67UTNH4M3Q54ICRA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJALY5Q#issuecomment-574667894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RKA3YAA7X4D6QEQBUDQ54ICRANCNFSM4KGJYSIQ.
@jedwards4b I'm following your instructions and I ran into two problems so far.
1- had to add --project
and there is more info in /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/TestStatus.log
@pjpegion This looks like a python version issue - what python are you using?
I think that @uturuncoglu is using 2.7.13 and hasn't tested with python3 yet. I will fix, but if you could please try with default python on cheyenne.
I see my python is defaulting to /glade/u/home/pegion/miniconda3/bin/python I will change that and try again. Thanks.
This release is only compatible with Python 2.7.x (also because CCPP works only with those versions).
On Wed, Jan 15, 2020 at 11:50 AM Dom Heinzeller notifications@github.com wrote:
This release is only compatible with Python 2.7.x (also because CCPP works only with those versions).
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/58?email_source=notifications&email_token=AL5NYI2G7Q4P2FW3C3U7EZDQ545FPA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJA76VI#issuecomment-574750549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL5NYIZGVJ7URKY3IOTT7SLQ545FPANCNFSM4KGJYSIQ .
For the next anticipated release of the ufs (with SAR etc) later this year we will hopefully be able to support Python 3. I don't see a chance to rewrite the code to work with Python 3, and what is more it will probably take years for Python 2.7 to completely disappear from HPCs and standard OS installations.
CIME is fully compatible with and tested with python 3.6 as well as python 2.7. The fv3_interface issue should be easy to fix.
@jedwards4b The model nows builds and the run starts. But the model crashes in initialization, log file is /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/run/ufs.log.238590.chadmin1.ib0.cheyenne.ucar.edu.200115-095701
Can someone point me to the job submission script for this job, please? Thanks ...
@climbfuji I am running out of /glade/work/pegion/UFS/ufs-mrweather-app/cime/scripts command is ./create_test SMS_D_Lh5.C96.GFSv15p2 --workflow ufs-mrweather_wo_post --test-id try --project P93300042
Thanks, but I don't know how to find the actual job submission script (the file that contains the #PBS configuration entries and the mpiexec_mpt calls) from there. Maybe the CIME folks can help? We should always write/copy this job submission script into the run directory using a filename like job_card, because many developers who are used to rerun some of the stuff manually will want this. And it is also good for documentation purposes in my opinion.
@pjpegion Now we are in the same place - I am trying to understand and fix this failure.
86:MPT: #6 0x0000000002d0e761 in fv_sat_adj_mp_fv_sat_adj_work_ ()
86:MPT: at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:664
86:MPT: #7 0x0000000002d0b276 in fv_sat_adj_mp_fv_sat_adj_run_ ()
86:MPT: at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:330
86:MPT: #8 0x0000000002be87db in ccpp_fv3_gfs_v15p2_fast_physics_cap_mp_fv3_gfs_v15p2_fast_physics_run_cap_ ()
86:MPT: at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_FV3_GFS_v15p2_fast_physics\
_cap.F90:106
86:MPT: #9 0x0000000002bdf0df in ccpp_static_api::ccpp_physics_run (cdata=...,
86:MPT: suite_name=..., group_name=..., ierr=0, .tmp.SUITE_NAME.len_V$97da=13,
86:MPT: .tmp.GROUP_NAME.len_V$97dd=12)
86:MPT: at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_static_api.F90:143
86:MPT: #10 0x0000000000d4340a in fv_mapz_mod::lagrangian_to_eulerian (
I I had to guess I would say initial conditions. It's the first time it is calling the saturation adjustment as part of the dynamics before doing any physics, i.e. right after reading the initial conditions. I am downloading the run dirs for my rt.sh run and your cime run to my laptop to take a closer look at the diffs.
@climbfuji The job submission script is in the case directory: ./case.submit
If you want to see what the script will submit you would run ./preview_run
By default we will submit the chgres and then the model - if you only want to submit the model use ./case.submit --job case.test
@climbfuji please point me to your build log - I want to confirm that we are using the same flags to build ccpp.
/glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-public-release-20200114/tests/log_cheyenne.intel/compile_2.log
is the log for the debug tests (compile step)
I did find a problem with the build and am working on it, but I don't think that it is related to this run failure and agree that there seems to be a problem with initial conditions.
It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places:
TASKID FILE LINE VALUE INDEX
89: moninedmf.f 412 -2.213609288845146E+021 2
90: moninedmf.f 412 -4.427218577690292E+021 6
I was able to run to completion by using the initial conditions in /glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/INPUT
This points to a problem in chgres or in the initial condition files themselves. I'm not sure where to go from here. @uturuncoglu @climbfuji
It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places: TASKID FILE LINE VALUE INDEX 89: moninedmf.f 412 -2.213609288845146E+021 2 90: moninedmf.f 412 -4.427218577690292E+021 6
Dusan had the same error a couple months ago. It was traced to ice concentrations greater than 1.0 (such as 1.0000000000004) in the initial surface file from chgres. A fix was added. Can you merge the latest chgres from 'develop' to your branch?
@arunchawla-NOAA I have opened issue https://github.com/NOAA-EMC/NCEPLIBS/issues/21 but I am not sure who to assign.
We received the following error running UFSATM with Intel 19/MPT on Cheyenne with physics GFSv15p2. We are running in debug mode and a SIGFPE was caught inside the physics (see stack trace below).
CIME test:
SMS_D.C96.GFSv15p2.cheyenne_intel
Modules:
Hash of UFS weather model: https://github.com/ufs-community/ufs-weather-model/commit/bde62f9116cc9bdebaae0c6057090fe468eae917
We can provide more information, as needed, on the initial conditions.
Stack trace: