ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 23 forks source link

UFSATM failure with Intel with debug flags #58

Closed rsdunlapiv closed 4 years ago

rsdunlapiv commented 4 years ago

We received the following error running UFSATM with Intel 19/MPT on Cheyenne with physics GFSv15p2. We are running in debug mode and a SIGFPE was caught inside the physics (see stack trace below).

CIME test: SMS_D.C96.GFSv15p2.cheyenne_intel

Modules:

module load ncarenv/1.2 intel/19.0.2 esmf_libs mkl
module use /glade/work/turuncu/PROGS/modulefiles/esmfpkgs/intel/19.0.2
module load esmf-8.0.0-ncdfio-mpt-g mpt/2.19 netcdf/4.7.1 pnetcdf/1.11.1 ncarcompilers/0.5.0

Hash of UFS weather model: https://github.com/ufs-community/ufs-weather-model/commit/bde62f9116cc9bdebaae0c6057090fe468eae917

We can provide more information, as needed, on the initial conditions.

Stack trace:

37:MPT: #1  0x00002ad4deaafdb6 in mpi_sgi_system (
37:MPT: #2  MPI_SGI_stacktraceback (
37:MPT:     header=header@entry=0x7ffcd7c52b00 "MPT ERROR: Rank 37(g:37) received signal SIGFPE(8).\n\tProcess ID: 40083, Host: r2i3n29, Program: /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/ufs.exe\n\tMPT Version: HPE MPT 2.19  "...) at sig.c:340
37:MPT: #3  0x00002ad4deaaffb2 in first_arriver_handler (signo=signo@entry=8, 
37:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2ad4eb6a0080) at sig.c:489
37:MPT: #4  0x00002ad4deab034b in slave_sig_handler (signo=8, siginfo=<optimized out>, 
37:MPT:     extra=<optimized out>) at sig.c:564
37:MPT: #5  <signal handler called>
37:MPT: #6  0x0000000002d0e761 in fv_sat_adj_mp_fv_sat_adj_work_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:664
37:MPT: #7  0x0000000002d0b276 in fv_sat_adj_mp_fv_sat_adj_run_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:330
37:MPT: #8  0x0000000002be87db in ccpp_fv3_gfs_v15p2_fast_physics_cap_mp_fv3_gfs_v15p2_fast_physics_run_cap_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/ccpp_FV3_GFS_v15p2_fast_physics_cap.F90:106
37:MPT: #9  0x0000000002bdf0df in ccpp_static_api::ccpp_physics_run (cdata=..., 
37:MPT:     suite_name=..., group_name=..., ierr=0, .tmp.SUITE_NAME.len_V$97da=13, 
37:MPT:     .tmp.GROUP_NAME.len_V$97dd=12)
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/ccpp_static_api.F90:143
37:MPT: #10 0x0000000000d4340a in fv_mapz_mod::lagrangian_to_eulerian (
pjpegion commented 4 years ago

@rsdunlapiv I believe that is just the place where an un-physical temperature triggers the array index to go out of bounds (it is associated with the calculation of the saturation specific humidity.)

climbfuji commented 4 years ago

Dusan merged an update to the ufs_public_release earlier today, part of the update was to address regression test failures in debug mode and to enable those tests for both 15p2 and 16beta as standard regression tests. These tests passed on Cheyenne with Intel and GNU and on Hera with Intel; they are based on the C96 configurations. See https://github.com/ufs-community/ufs-weather-model/pull/25 and https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf. Many questions: resolution, setup (namelist etc.), initial conditions? Can you point me to the run directory on Cheyenne, please?

jedwards4b commented 4 years ago

Some of these questions are answered in the cime test name: SMS_D.C96.GFSv15p2.cheyenne_intel The resolution is C96 The CCPP is v15p2 Machine is cheyenne Compiler is intel

Initial conditions are 2019-09-09 00 Case directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp Run directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp/run

jedwards4b commented 4 years ago

I am using commit bde62f9116cc9bdebaae0c6057090fe468eae917 Author: Dom Heinzeller climbfuji@ymail.com Date: Mon Jan 13 06:46:37 2020 -0700

Which is the latest available.

jedwards4b commented 4 years ago

I get a similar error on stampede using v16beta at C96 resolution: forrtl: error (65): floating invalid Image PC Routine Line Source
ufs.exe 0000000004DC4E6F Unknown Unknown Unknown libpthread-2.17.s 00002AD70FCA05D0 Unknown Unknown Unknown ufs.exe 0000000002C9304B Unknown Unknown Unknown ufs.exe 0000000002B48CA1 Unknown Unknown Unknown ufs.exe 0000000002AA9A85 Unknown Unknown Unknown ufs.exe 0000000002A9A8D3 ccpp_static_api_m 147 ccpp_static_api.F90 ufs.exe 0000000002AA02B9 ccpp_driver_mp_cc 234 CCPP_driver.F90 ufs.exe 000000000063684F atmos_model_mod_m 338 atmos_model.F90 ufs.exe 000000000062A0DE module_fcstgrid 707 module_fcst_grid_comp.F90

jedwards4b commented 4 years ago

On stampede using v15p2 at C96: floating divide by zero Image PC Routine Line Source
ufs.exe 0000000004DBBC8F Unknown Unknown Unknown libpthread-2.17.s 00002AFB84DDD5D0 Unknown Unknown Unknown ufs.exe 0000000002CCDE5E Unknown Unknown Unknown ufs.exe 0000000002C22E39 Unknown Unknown Unknown ufs.exe 0000000002B74C17 Unknown Unknown Unknown ufs.exe 0000000002AA3D13 Unknown Unknown Unknown ufs.exe 0000000002A9A72F ccpp_static_api_m 145 ccpp_static_api.F90 ufs.exe 0000000002A9F502 ccpp_driver_mp_cc 197 CCPP_driver.F90 ufs.exe 0000000000634AB0 atmos_model_mod_m 295 atmos_model.F90 ufs.exe 000000000062A0DE module_fcstgrid 707 module_fcst_grid_comp.F90

climbfuji commented 4 years ago

I am at AMS this week and don't have much time to look into this. The easiest way forward imo is to compare the run directory (input files, namelist, ...) of your CIME setup to the ufs-weather-model regression test setup (which uses rt.sh to run and which completes successfully) on Cheyenne using the Intel compiler. I can point you to a directory containing a successful run if that helps.

jedwards4b commented 4 years ago

Please point me to a successful run with debug flags enabled and I will compare.

climbfuji commented 4 years ago

Jim, see

/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/
/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v16beta_debug_prod/

These are C96 test cases as in your CIME setup, and both run to completion for a 6h forecast when the model is compiled with DEBUG=Y.

jedwards4b commented 4 years ago

@pjpegion I am ready to enlist your help. Instructions for running the tests on cheyenne are here: https://docs.google.com/document/d/13nvpIS_q87ttjjHwB9f8OFXX7YI00DM5O9V-gk04yAY/edit?usp=sharing

ligiabernardet commented 4 years ago

@llpcarson @julieschramm Let's run this test on Cheyenne and use it to review (and update, if needed) the Weather Model User's Guide on the directory structure and lists of input/output files. Keep in mind that this run uses CIME, the WM UG should be relevant to those using CIME, as well as to those running the model in other ways.

climbfuji commented 4 years ago

Did the comparison with the run directories that I gave @jedwards4b lead to any insight? I am not sure it makes sense to have more people try to run and debug this unless we understand why the regression tests in the ufs-weather-model run to completion in DEBUG mode while the CIME runs don't.

jedwards4b commented 4 years ago

I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to.

I think that it does make sense to have @pjpegion and @ligiabernardet and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure.

climbfuji commented 4 years ago

Ok, thanks or the info. I am happy to take a look as well.

On Jan 15, 2020, at 8:50 AM, jedwards4b notifications@github.com wrote:

I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to.

I think that it does make sense to have @pjpegion https://github.com/pjpegion and @ligiabernardet https://github.com/ligiabernardet and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/58?email_source=notifications&email_token=AB5C2RIGFWHEHE67UTNH4M3Q54ICRA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJALY5Q#issuecomment-574667894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RKA3YAA7X4D6QEQBUDQ54ICRANCNFSM4KGJYSIQ.

pjpegion commented 4 years ago

@jedwards4b I'm following your instructions and I ran into two problems so far.
1- had to add --project to ./create_test lime 2- Now I get an error Case dir: /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try Errors were: Building test for SMS in directory /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try ERROR: /glade/work/pegion/UFS/ufs-mrweather-app/src/model/FV3/cime/cime_config/buildnml /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try FAILED, see above

and there is more info in /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/TestStatus.log

jedwards4b commented 4 years ago

@pjpegion This looks like a python version issue - what python are you using?

jedwards4b commented 4 years ago

I think that @uturuncoglu is using 2.7.13 and hasn't tested with python3 yet. I will fix, but if you could please try with default python on cheyenne.

pjpegion commented 4 years ago

I see my python is defaulting to /glade/u/home/pegion/miniconda3/bin/python I will change that and try again. Thanks.

climbfuji commented 4 years ago

This release is only compatible with Python 2.7.x (also because CCPP works only with those versions).

arunchawla-NOAA commented 4 years ago

Python 2.7 is getting deprecated this year. Does it make sense to limit to an unsupported version of Python? We are moving to Python 3 everywhere.

Arun Chawla Chief Engineering & Implementation Branch Room 2083 National Center for Weather & Climate Prediction 5830 University Research Court College Park, MD 20740 Phone : 301-683-3740 Mobile : 240-564-5675 Fax : 301-683-3703

On Wed, Jan 15, 2020 at 11:50 AM Dom Heinzeller notifications@github.com wrote:

This release is only compatible with Python 2.7.x (also because CCPP works only with those versions).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/58?email_source=notifications&email_token=AL5NYI2G7Q4P2FW3C3U7EZDQ545FPA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJA76VI#issuecomment-574750549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL5NYIZGVJ7URKY3IOTT7SLQ545FPANCNFSM4KGJYSIQ .

climbfuji commented 4 years ago

For the next anticipated release of the ufs (with SAR etc) later this year we will hopefully be able to support Python 3. I don't see a chance to rewrite the code to work with Python 3, and what is more it will probably take years for Python 2.7 to completely disappear from HPCs and standard OS installations.

jedwards4b commented 4 years ago

CIME is fully compatible with and tested with python 3.6 as well as python 2.7. The fv3_interface issue should be easy to fix.

pjpegion commented 4 years ago

@jedwards4b The model nows builds and the run starts. But the model crashes in initialization, log file is /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/run/ufs.log.238590.chadmin1.ib0.cheyenne.ucar.edu.200115-095701

climbfuji commented 4 years ago

Can someone point me to the job submission script for this job, please? Thanks ...

pjpegion commented 4 years ago

@climbfuji I am running out of /glade/work/pegion/UFS/ufs-mrweather-app/cime/scripts command is ./create_test SMS_D_Lh5.C96.GFSv15p2 --workflow ufs-mrweather_wo_post --test-id try --project P93300042

climbfuji commented 4 years ago

Thanks, but I don't know how to find the actual job submission script (the file that contains the #PBS configuration entries and the mpiexec_mpt calls) from there. Maybe the CIME folks can help? We should always write/copy this job submission script into the run directory using a filename like job_card, because many developers who are used to rerun some of the stuff manually will want this. And it is also good for documentation purposes in my opinion.

jedwards4b commented 4 years ago

@pjpegion Now we are in the same place - I am trying to understand and fix this failure.

86:MPT: #6  0x0000000002d0e761 in fv_sat_adj_mp_fv_sat_adj_work_ ()                                                                         
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:664 
86:MPT: #7  0x0000000002d0b276 in fv_sat_adj_mp_fv_sat_adj_run_ ()                                                                          
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:330 
86:MPT: #8  0x0000000002be87db in ccpp_fv3_gfs_v15p2_fast_physics_cap_mp_fv3_gfs_v15p2_fast_physics_run_cap_ ()                             
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_FV3_GFS_v15p2_fast_physics\
_cap.F90:106                                                                                                                                
86:MPT: #9  0x0000000002bdf0df in ccpp_static_api::ccpp_physics_run (cdata=...,                                                             
86:MPT:     suite_name=..., group_name=..., ierr=0, .tmp.SUITE_NAME.len_V$97da=13,                                                          
86:MPT:     .tmp.GROUP_NAME.len_V$97dd=12)                                                                                                  
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_static_api.F90:143         
86:MPT: #10 0x0000000000d4340a in fv_mapz_mod::lagrangian_to_eulerian (                                                                  
climbfuji commented 4 years ago

I I had to guess I would say initial conditions. It's the first time it is calling the saturation adjustment as part of the dynamics before doing any physics, i.e. right after reading the initial conditions. I am downloading the run dirs for my rt.sh run and your cime run to my laptop to take a closer look at the diffs.

jedwards4b commented 4 years ago

@climbfuji The job submission script is in the case directory: ./case.submit

If you want to see what the script will submit you would run ./preview_run

By default we will submit the chgres and then the model - if you only want to submit the model use ./case.submit --job case.test

jedwards4b commented 4 years ago

@climbfuji please point me to your build log - I want to confirm that we are using the same flags to build ccpp.

climbfuji commented 4 years ago

/glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-public-release-20200114/tests/log_cheyenne.intel/compile_2.log

is the log for the debug tests (compile step)

jedwards4b commented 4 years ago

I did find a problem with the build and am working on it, but I don't think that it is related to this run failure and agree that there seems to be a problem with initial conditions.

jedwards4b commented 4 years ago

It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places: TASKID FILE LINE VALUE INDEX 89: moninedmf.f 412 -2.213609288845146E+021 2
90: moninedmf.f 412 -4.427218577690292E+021 6

jedwards4b commented 4 years ago

I was able to run to completion by using the initial conditions in /glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/INPUT

This points to a problem in chgres or in the initial condition files themselves. I'm not sure where to go from here. @uturuncoglu @climbfuji

GeorgeGayno-NOAA commented 4 years ago

It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places: TASKID FILE LINE VALUE INDEX 89: moninedmf.f 412 -2.213609288845146E+021 2 90: moninedmf.f 412 -4.427218577690292E+021 6

Dusan had the same error a couple months ago. It was traced to ice concentrations greater than 1.0 (such as 1.0000000000004) in the initial surface file from chgres. A fix was added. Can you merge the latest chgres from 'develop' to your branch?

jedwards4b commented 4 years ago

@arunchawla-NOAA I have opened issue https://github.com/NOAA-EMC/NCEPLIBS/issues/21 but I am not sure who to assign.