ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 23 forks source link

chgres on cheyenne for grib input hanging at C384 and C796 #92

Closed jedwards4b closed 4 years ago

jedwards4b commented 4 years ago

This is happening with the dorian input data 2019-08-29 . I have also tried another date 2020-01-02 with the same result. I am using the chgres in module use /glade/p/ral/jntp/GMTB/tools/modulefiles/intel-19.0.5/mpt-2.19 module load NCEPlibs/1.0.0beta02 and in module use /glade/p/ral/jntp/GMTB/tools/modulefiles/gnu-8.3.0/mpt-2.19 module load NCEPlibs/1.0.0beta02

GeorgeGayno-NOAA commented 4 years ago

According to the log file, where is it hanging?

uturuncoglu commented 4 years ago

On Stampede2, C384 is working but C768 is failing with following trace,

CALL FieldCreate FOR TARGET TERRAIN. CALL FieldCreate FOR TARGET SURFACE PRESSURE BEFORE ADJUSTMENT. CALL FieldRegridStore FOR ATMOSPHERIC FIELDS. forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source chgres_cube.exe 00000000008EA5CD Unknown Unknown Unknown libpthread-2.17.s 00002B56E72835D0 Unknown Unknown Unknown libmpifort.so.12. 00002B56E6093C54 mpi_abort Unknown Unknown chgres_cube.exe 00000000005CB526 Unknown Unknown Unknown chgres_cube.exe 0000000000417A58 Unknown Unknown Unknown chgres_cube.exe 0000000000439B6E Unknown Unknown Unknown chgres_cube.exe 0000000000414F1E Unknown Unknown Unknown libc-2.17.so 00002B56E788A3D5 libc_start_main Unknown Unknown chgres_cube.exe 0000000000414E06 Unknown Unknown Unknown forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source chgres_cube.exe 00000000008EA5CD Unknown Unknown Unknown libpthread-2.17.s 00002B9D822EC5D0 Unknown Unknown Unknown libmpifort.so.12. 00002B9D810FCC54 mpi_abort Unknown Unknown chgres_cube.exe 00000000005CB526 Unknown Unknown Unknown chgres_cube.exe 0000000000417A58 Unknown Unknown Unknown chgres_cube.exe 0000000000439B6E Unknown Unknown Unknown chgres_cube.exe 0000000000414F1E Unknown Unknown Unknown libc-2.17.so 00002B9D828F33D5 __libc_start_main Unknown Unknown chgres_cube.exe 0000000000414E06 Unknown Unknown Unknown longjmp causes uninitialized stack frame : /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe terminated longjmp causes uninitialized stack frame : /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe terminated ======= Backtrace: ========= /lib64/libc.so.6(fortify_fail+0x37)[0x2ac9250769e7] /lib64/libc.so.6(+0x1178fd)[0x2ac9250768fd] /lib64/libc.so.6(__longjmp_chk+0x29)[0x2ac925076859] /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe[0x9df0e2] longjmp causes uninitialized stack frame : /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe terminated

GeorgeGayno-NOAA commented 4 years ago

It may be running out of memory. Can you run across more nodes?

climbfuji commented 4 years ago

Can you send me the location of your run directory on cheyenne, please? I can take a look. At first glance it seems this isn't an issue with the chgres build (because C384 works), but I'll have to look closer. Thanks!

On Feb 21, 2020, at 1:36 PM, Ufuk Turunçoğlu notifications@github.com wrote:

On Stampede2, C384 is working but C768 is failing with following trace,

CALL FieldCreate FOR TARGET TERRAIN. CALL FieldCreate FOR TARGET SURFACE PRESSURE BEFORE ADJUSTMENT. CALL FieldRegridStore FOR ATMOSPHERIC FIELDS. forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source chgres_cube.exe 00000000008EA5CD Unknown Unknown Unknown libpthread-2.17.s 00002B56E72835D0 Unknown Unknown Unknown libmpifort.so.12. 00002B56E6093C54 mpi_abort Unknown Unknown chgres_cube.exe 00000000005CB526 Unknown Unknown Unknown chgres_cube.exe 0000000000417A58 Unknown Unknown Unknown chgres_cube.exe 0000000000439B6E Unknown Unknown Unknown chgres_cube.exe 0000000000414F1E Unknown Unknown Unknown libc-2.17.so 00002B56E788A3D5 libc_start_main Unknown Unknown chgres_cube.exe 0000000000414E06 Unknown Unknown Unknown forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source chgres_cube.exe 00000000008EA5CD Unknown Unknown Unknown libpthread-2.17.s 00002B9D822EC5D0 Unknown Unknown Unknown libmpifort.so.12. 00002B9D810FCC54 mpi_abort Unknown Unknown chgres_cube.exe 00000000005CB526 Unknown Unknown Unknown chgres_cube.exe 0000000000417A58 Unknown Unknown Unknown chgres_cube.exe 0000000000439B6E Unknown Unknown Unknown chgres_cube.exe 0000000000414F1E Unknown Unknown Unknown libc-2.17.so 00002B9D828F33D5 __libc_start_main Unknown Unknown chgres_cube.exe 0000000000414E06 Unknown Unknown Unknown longjmp causes uninitialized stack frame : /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe terminated longjmp causes uninitialized stack frame : /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe terminated ======= Backtrace: ========= /lib64/libc.so.6(fortify_fail+0x37)[0x2ac9250769e7] /lib64/libc.so.6(+0x1178fd)[0x2ac9250768fd] /lib64/libc.so.6(__longjmp_chk+0x29)[0x2ac925076859] /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe[0x9df0e2] longjmp causes uninitialized stack frame : /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g/bld/chgres_cube.exe terminated

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/92?email_source=notifications&email_token=AB5C2RPCYC2WB2XVDSMIUNDREA3MFA5CNFSM4KZFT362YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUAIBA#issuecomment-589825028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RNO3RSJUEJ5LZTRFSLREA3MFANCNFSM4KZFT36Q.

uturuncoglu commented 4 years ago

It is failing on Cheyenne like following

MPT: --------stack traceback-------

We tried to run with more nodes but it does not help. We could also process same file without any problem at lower resolution.

uturuncoglu commented 4 years ago

@climbfuji

on Stampede2: /scratch/01118/tg803972/SMS_Lh3.C768.GFSv15p2.stampede2-skx_intel.G.20200220_151254_03nf9g

on Cheyenne: /glade/scratch/jedwards/SMS_Lh3.C768.GFSv15p2.cheyenne_intel.GC.20200220_160700_k8hb8w/run

climbfuji commented 4 years ago

The following works just fine on Cheyenne:

rsync -av /glade/scratch/jedwards/SMS_Lh3.C768.GFSv15p2.cheyenne_intel.GC.20200220_160700_k8hb8w/run/ /glade/scratch/heinzell/chgres_crash_c768_jim/

cd /glade/scratch/heinzell/chgres_crash_c768_jim
# Launch interactive job on large-memory node (I did not test regular nodes).
qsub -X -I -l select=1:ncpus=36:mpiprocs=36:mem=109GB -l walltime=01:00:00 -q premium -A P48503002
# once the interactive session has started:
module purge
module load ncarenv/1.3
module load intel/19.0.5
module load ncarcompilers/0.5.0
module load netcdf/4.7.3
module load mpt/2.19
module load cmake/3.16.4

module use -a /glade/p/ral/jntp/GMTB/tools/modulefiles/intel-19.0.5/mpt-2.19
module load  NCEPlibs/1.0.0beta02

mpiexec_mpt -np 36 /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.0.0.beta02/intel-19.0.5/mpt-2.19/bin/chgres_cube.exe

Please have a look in my directory /glade/scratch/heinzell/chgres_crash_c768_jim/, you'll find the output files there.

climbfuji commented 4 years ago

Update: the same setup crashes when using a standard node with 64GB memory (45GB usable). The large-memory nodes have 128GB of memory (109GB usable). See https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne/running-jobs/submitting-jobs-pbs

jedwards4b commented 4 years ago

I can confirm that running chgres on a large memory node works - but I also checked that it doesn't work on a regular memory node. I'm looking in to how we can submit successfully without having to request large memory nodes (trying 18 tasks per node instead of 36)

jedwards4b commented 4 years ago

This seems to be a non-scalable memory issue - increasing the number of nodes and decreasing the number of tasks per node does not appear to help. Meanwhile I have confirmed that when running with nemsio data and the same configuration we don't have this issue.

climbfuji commented 4 years ago

The bigmem nodes are usually available fairly quickly, especially if you choose the premium queue. Since the job runs only for two minutes or so, these additional expenses should be ok.

On Feb 21, 2020, at 3:33 PM, jedwards4b notifications@github.com wrote:

This seems to be a non-scalable memory issue - increasing the number of nodes and decreasing the number of tasks per node does not appear to help. Meanwhile I have confirmed that when running with nemsio data and the same configuration we don't have this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/92?email_source=notifications&email_token=AB5C2RL5IOCJBEPZMAI2Q3DREBJELA5CNFSM4KZFT362YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUKCOY#issuecomment-589865275, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RJ5PD7L3XFNHKF4ZP3REBJELANCNFSM4KZFT36Q.

llpcarson commented 4 years ago

A related issue - when the chgres task hung, and eventually timed out, the forecast task was deleted, but the "post" task is still in the queue with a Hold status. Any way to also delete this job from the batch queue?

Laurie

On Fri, Feb 21, 2020 at 2:58 PM jedwards4b notifications@github.com wrote:

I can confirm that running chgres on a large memory node works - but I also checked that it doesn't work on a regular memory node. I'm looking in to how we can submit successfully without having to request large memory nodes (trying 18 tasks per node instead of 36)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/92?email_source=notifications&email_token=AB2OWISSXDPGRVBUKTJJABLREBFCDA5CNFSM4KZFT362YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUHQAY#issuecomment-589854723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIS6O6PBDMHNG7G5V3TREBFCDANCNFSM4KZFT36Q .

jedwards4b commented 4 years ago

@llpcarson this seems to be system dependent some will delete the job right away, on others it hangs around for awhile, but this has not caused any problems otherwise.

@climbfuji We can do this for cheyenne but it seems like a potential problem for the community.

jedwards4b commented 4 years ago

@climbfuji - even when I use high memory nodes if I use more than a single node chgres hangs.

climbfuji commented 4 years ago

This is outside my territory, I hope the chgres developers can say more to this.

arunchawla-NOAA commented 4 years ago

@GeorgeGayno-NOAA any idea why this is happening?

GeorgeGayno-NOAA commented 4 years ago

It is failing on Cheyenne like following

DEFINE INPUT GRID OBJECT FOR INPUT GRIB2 DATA. OPEN AND INVENTORY GRIB2 FILE: /glade/p/cesmdata/cseg/ufs_inputdata/icfiles/201908/20190829/gfs_4_20190829_000 0_000.grb2 DEFINE INPUT GRID OBJECT FOR INPUT GRIB2 DATA. OPEN AND INVENTORY GRIB2 FILE: /glade/p/cesmdata/cseg/ufs_inputdata/icfiles/201908/20190829/gfs_4_20190829_000 0_000.grb2 FATAL ERROR: READING GRIB2 FILE IOSTAT IS: 0 MPT ERROR: Rank 180(g:180) received signal SIGSEGV(11). Process ID: 15583, Host: r11i1n28, Program: /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.0.0.beta02/intel-19.0.5/mpt-2.19/bin/chgres_cube.exe MPT Version: HPE MPT 2.19 02/23/19 05:30:09

MPT: --------stack traceback-------

FATAL ERROR: READING GRIB2 FILE IOSTAT IS: 0 FATAL ERROR: READING GRIB2 FILE IOSTAT IS: 0 FATAL ERROR: READING GRIB2 FILE IOSTAT IS: 0 FATAL ERROR: READING GRIB2 FILE IOSTAT IS: 0 FATAL ERROR: READING GRIB2 FILE IOSTAT IS: 0 FATAL ERROR: READING GRIB2 FILE

We tried to run with more nodes but it does not help. We could also process same file without any problem at lower resolution.

An error reading the grib2 file should not be related to the model resolution. And it should not be a memory issue. Was the grib2 file accidentally deleted?

jedwards4b commented 4 years ago

@GeorgeGayno-NOAA: no the file is fine, we confirmed this a number of times. It's clearly a memory issue because if you run on a single high-memory node it works, but on a single standard memory node it does not and is killed by the systems memory monitor tool. What confuses me is that when you run across multiple nodes it does not work regardless of whether those nodes are high-memory or not. In this case it is not killed by the memory monitor tool, but just hangs until the wallclock runs out.

llpcarson commented 4 years ago

Jim - Just FYI, my "post" job is still in the queue on cheyenne in a "H" state, after the chgres job timed out (and the dependent forecast job was appropriately killed). So, this second-level dependency doesn't seem to be handled when the first job crashes (or hangs and is killed).

Laurie

On Fri, Feb 21, 2020 at 3:38 PM jedwards4b notifications@github.com wrote:

@llpcarson https://github.com/llpcarson this seems to be system dependent some will delete the job right away, on others it hangs around for awhile, but this has not caused any problems otherwise.

@climbfuji https://github.com/climbfuji We can do this for cheyenne but it seems like a potential problem for the community.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/92?email_source=notifications&email_token=AB2OWIVWBFSGFR2KAGJSU7LREBJV7A5CNFSM4KZFT362YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUKMQY#issuecomment-589866563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIQTZ5YCKT2XZZJB443REBJV7ANCNFSM4KZFT36Q .

jedwards4b commented 4 years ago

@llpcarson it's a cheyenne "feature" - it has no adverse affect on subsequent usage of the system.

arunchawla-NOAA commented 4 years ago

@jedwards4b @mark-a-potts can you check if we can use the grib2 fields to run a high resolution case on other platforms like Stampede or Hera ? @GeorgeGayno-NOAA does not have access to Cheyenne so we are a bit at odds as to why this is happening

jedwards4b commented 4 years ago

Ufuk and I are looking at stampede, but I think that the issue may have something to do with the non-standard mpi library on cheyenne (mpt). I plan to look into that when cheyenne is returned to service.

GeorgeGayno-NOAA commented 4 years ago

@jedwards4b @mark-a-potts can you check if we can use the grib2 fields to run a high resolution case on other platforms like Stampede or Hera ? @GeorgeGayno-NOAA does not have access to Cheyenne so we are a bit at odds as to why this is happening

I tested chgres on Hera using the head of 'develop' (I don't know how to compile the public release branch on Hera). It worked for C768/L65 using both 0.25 and 0.5-degree grib2 data. I used two nodes, six tasks per node. If someone can compile the release branch for me, I can repeat my tests.

climbfuji commented 4 years ago

@GeorgeGayno-NOAA the public release version of chgres is here:

/scratch1/BMC/gmtb/software/NCEPLIBS-ufs-v1.0.0.beta03/intel-18.0.5.274/impi-2018.0.4/

These are the instructions to load/use:

module load intel/18.0.5.274
module load impi/2018.0.4
module load netcdf/4.7.0
module use -a /scratch1/BMC/gmtb/software/modulefiles/generic
module load cmake/3.16.3

module use -a /scratch1/BMC/gmtb/software/modulefiles/intel-18.0.5.274/impi-2018.0.4
module load NCEPlibs/1.0.0beta03

Thanks for testing this!

GeorgeGayno-NOAA commented 4 years ago

@GeorgeGayno-NOAA the public release version of chgres is here:

/scratch1/BMC/gmtb/software/NCEPLIBS-ufs-v1.0.0.beta03/intel-18.0.5.274/impi-2018.0.4/

These are the instructions to load/use:

module load intel/18.0.5.274
module load impi/2018.0.4
module load netcdf/4.7.0
module use -a /scratch1/BMC/gmtb/software/modulefiles/generic
module load cmake/3.16.3

module use -a /scratch1/BMC/gmtb/software/modulefiles/intel-18.0.5.274/impi-2018.0.4
module load NCEPlibs/1.0.0beta03

Thanks for testing this!

I was able to get your compiled public release chgres to run on Hera. C768/L65 using both 0.25-a and 0.5-degree grib2 data. Two nodes, six tasks per node. Here is the script I used: /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/chgres.grib2/run.public.rlease.ksh. Here is the namelist I used: /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/chgres.grib2/config.C768.nml.

jedwards4b commented 4 years ago

I think that this may be a cheyenne specific issue, perhaps we should table this until that system is back in service.

arunchawla-NOAA commented 4 years ago

@jedwards4b Did we try this again on Cheyenne now that it is back? Otherwise can we mark it as a known issue and close this ticket?

jedwards4b commented 4 years ago

Yes, I think that's what we need to do.

On Tue, Mar 3, 2020 at 6:22 PM arun chawla notifications@github.com wrote:

@jedwards4b https://github.com/jedwards4b Did we try this again on Cheyenne now that it is back? Otherwise can we mark it as a known issue and close this ticket?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/92?email_source=notifications&email_token=ABOXUGG4OXIMFLACHVPEDT3RFWUPHA5CNFSM4KZFT362YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENV2Q5I#issuecomment-594258037, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOXUGHVXU34KFIXSQKAWFDRFWUPHANCNFSM4KZFT36Q .

-- Jim Edwards

CESM Software Engineer National Center for Atmospheric Research Boulder, CO

rsdunlapiv commented 4 years ago

@jedwards4b what do we need to do to test the C768 on Cheyenne?

lbnance commented 4 years ago

might want to check with phil since based on test spreadsheet https://docs.google.com/spreadsheets/d/1eD5pnPf8g_-atyR7r2hlKyHSyhOC95M28qKU8Jls_-c/edit#gid=0 it looks like he has run some tests on cheyenne for this configuration?

jedwards4b commented 4 years ago

We are marking it as a known problem on cheyenne - if you want to use C768 you should download nemsio data or manual set chgres to run on a high memory node.

On Wed, Mar 4, 2020 at 4:24 PM Rocky Dunlap notifications@github.com wrote:

@jedwards4b https://github.com/jedwards4b what do we need to do to test the C768 on Cheyenne?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/92?email_source=notifications&email_token=ABOXUGE2LKX2INOIZF54GJTRF3PJFA5CNFSM4KZFT362YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN26FQI#issuecomment-594928321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOXUGB6UCFEJWUJD2EGDMDRF3PJFANCNFSM4KZFT36Q .

-- Jim Edwards

CESM Software Engineer National Center for Atmospheric Research Boulder, CO

pjpegion commented 4 years ago

@lbnance My successful run at C768 was on hera, not cheyenne.