ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 23 forks source link

Hanging on Cheyenne ... #190

Closed uturuncoglu closed 3 years ago

uturuncoglu commented 4 years ago

@climbfuji @ligiabernardet i am having trouble with the model on Cheyenne and it is hanging when it is reading static input files such as global_shdmin.0.144x0.144.grb for the resolution > C96. This was also case for the new buildlib and I think it is not related with the build. Have you ever experience same problem? This is also reported previously on https://github.com/ufs-community/ufs-mrweather-app/issues/184#issuecomment-688622866. Do we need increase the resources that are used by the model? For example, C192 is hanging/failing without any particular error and I am using following configuration options,

ntiles = 6 layout = 4, 6 write_groups: 1 write_tasks_per_group: 36

and total 180 processor.

ligiabernardet commented 4 years ago

@uturuncoglu I have not received any reports of the model hanging. @llpcarson Any insight wrt hanging on Cheyenne?

uturuncoglu commented 4 years ago

@ligiabernardet it is strange. I updated buildlib and I am waiting to resolve this issue. Let me know, if you see similar issue.

llpcarson commented 4 years ago

No, I haven't seen this lately on cheyenne. One thing to check is the processor layout and the job-node-request. If these don't match, sometimes the model will hang (use 48 tasks, but submit the job with 64, for example)

Laurie

On Thu, Sep 10, 2020 at 11:17 AM ligiabernardet notifications@github.com wrote:

@uturuncoglu https://github.com/uturuncoglu I have not received any reports of the model hanging. @llpcarson https://github.com/llpcarson Any insight wrt hanging on Cheyenne?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-690531653, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIXCOZPBY3E3MHYOJBTSFEC23ANCNFSM4RFMB6EA .

ligiabernardet commented 4 years ago

@ufuk We are waiting on a PR of the updated build so we can merge it onto the release/public-v1 branch and conduct tests.

uturuncoglu commented 4 years ago

@llpcarson those options are consistent. Anyway, I'll make PR soon and you could test it. All those strange things happen in my account. Maybe there is some thing wrong in there. Let's see what you find in your tests.

uturuncoglu commented 4 years ago

@ligiabernardet I created PR in the app level.

ligiabernardet commented 3 years ago

@uturuncoglu Does it hang all the time or occasionally?

uturuncoglu commented 3 years ago

@ligiabernardet in my recent test all resolution failed with same way except C96 ones.

ligiabernardet commented 3 years ago

@uturuncoglu Here is a suggestion from @climbfuji: Are we using threading? If yes: Can we test compiling without OpenMP, or even easier, run with one OpenMP thread only, and see if this solves the problem?

ligiabernardet commented 3 years ago

@llpcarson is running some tests on Cheyenne. Laurie, let us know what you find out.

uturuncoglu commented 3 years ago

@climbfuji we are not using threading at least for following test

/glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200909_155451_1sg3p2

ant it still hang/fail when reading file.

@ligiabernardet thanks. I hope I am the only one that have the issue.

llpcarson commented 3 years ago

Ufuk, Ligia -

I ran the default MRW case, at C96, C384 and C768 and all 3 ran: grib2 input, threaded (4), 20190829

I can try running the CIME reg-tests next (that's what that case is, correct?)

Laurie

On Thu, Sep 10, 2020 at 3:57 PM Ufuk Turunçoğlu notifications@github.com wrote:

@climbfuji https://github.com/climbfuji we are not using threading at least for following test

/glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200909_155451_1sg3p2

ant it still hang/fail when reading file.

@ligiabernardet https://github.com/ligiabernardet thanks. I hope I am the only one that have the issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-690753453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIS5VJJPMQEXGGUYSGTSFFDUXANCNFSM4RFMB6EA .

uturuncoglu commented 3 years ago

@llpcarson That is great! Yes, if you run full test suite that would be great. Once you run the test suite (if you run without specifying compiler such as --xml-compiler intel that will run both Intel and GNU tests), please let me know the directory and I could double check the results. Thanks for your help.

llpcarson commented 3 years ago

Partial results to report:

All of the C96, C192 and C384 jobs have completed successfully. Three of the C786 jobs crashed in chgres_cube (and so the forecast jobs were killed for dependency-failure) 5 of the C768 cases are still in the batch queue waiting to run (these ran chgres_cube successfully)

Will let you know when the C768 jobs start running...

On Fri, Sep 11, 2020 at 9:12 AM Ufuk Turunçoğlu notifications@github.com wrote:

@llpcarson https://github.com/llpcarson That is great! Yes, if you run full test suite that would be great. Once you run the test suite (if you run without specifying compiler such as --xml-compiler intel that will run both Intel and GNU tests), please let me know the directory and I could double check the results. Thanks for your help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-691153784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIRC5D4AZWJRODMLJETSFI47JANCNFSM4RFMB6EA .

uturuncoglu commented 3 years ago

@llpcarson i was having problem with C768 also on Cheyenne. Is this on Cheyenne? Probably couple of them will pass and couple of them will fail. We might need to increase the allocated resource for C768 because it is not stable at this point. What do you think @GeorgeGayno-NOAA?

GeorgeGayno-NOAA commented 3 years ago

@llpcarson i was having problem with C768 also on Cheyenne. Is this on Cheyenne? Probably couple of them will pass and couple of them will fail. We might need to increase the allocated resource for C768 because it is not stable at this point. What do you think @GeorgeGayno-NOAA?

Is the model hanging or chgres_cube? I am more familiar with the latter.

uturuncoglu commented 3 years ago

@GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing.

llpcarson commented 3 years ago

On cheyenne:

Yes, chgres cube is failing (seg-fault) for some of the C768 cases (but not all).

The model/forecast jobs are still waiting in the queue (the ones that had a successful chgres_cube)

Laurie

On Fri, Sep 11, 2020 at 1:49 PM Ufuk Turunçoğlu notifications@github.com wrote:

@GeorgeGayno-NOAA https://github.com/GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-691282427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIURWVHBM5GYZ5FDAB3SFJ5N5ANCNFSM4RFMB6EA .

uturuncoglu commented 3 years ago

@llpcarson you could still check the run/INPUT folder of one of the cases to see CHGRES generated files. If they are there and the model will pick them and run. I hope it won't hang. What about other resolutions? Did you see any hang issue with the model?

GeorgeGayno-NOAA commented 3 years ago

On cheyenne: Yes, chgres cube is failing (seg-fault) for some of the C768 cases (but not all). The model/forecast jobs are still waiting in the queue (the ones that had a successful chgres_cube) Laurie On Fri, Sep 11, 2020 at 1:49 PM Ufuk Turunçoğlu @.***> wrote: @GeorgeGayno-NOAA https://github.com/GeorgeGayno-NOAA I think CHGRES is failing in C768. we are using 6 nodes with 6 core in per node as you suggested. It is running in some case and failing in other. So, not every C768 is failing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#190 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIURWVHBM5GYZ5FDAB3SFJ5N5ANCNFSM4RFMB6EA .

Are the failures happening with a certain input data, like grib2 or nemsio?

uturuncoglu commented 3 years ago

@GeorgeGayno-NOAA The default input type is GRIB2 and the test suit uses that one.

llpcarson commented 3 years ago

Yes, the chgres_cube run worked for these cases that are waiting to run the model. Failed for others. All C768. All of the other resolutions ran without issue (at least I think so!)

Rundir is: /glade/scratch/carson/ufs/* App dir is: /glade/scratch/carson/ufs/mrw.test/ufs-mrweather-app

The logfile from chgres_cube shows:

On Fri, Sep 11, 2020 at 2:01 PM Ufuk Turunçoğlu notifications@github.com wrote:

@llpcarson https://github.com/llpcarson you could still check the run/INPUT folder of one of the cases to see CHGRES generated files. If they are there and the model will pick them and run. I hope it won't hang. What about other resolutions? Did you see any hang issue with the model?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-691287061, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWISABPTH3V767XGF6MLSFJ6ZFANCNFSM4RFMB6EA .

uturuncoglu commented 3 years ago

@llpcarson I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests?

llpcarson commented 3 years ago

Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure:

/glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run

Does the _D part refer to a debug-mode compile? (just curious)

On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu notifications@github.com wrote:

@llpcarson https://github.com/llpcarson I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-691303457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA .

climbfuji commented 3 years ago

If it does, then this only applies to the model I guess, because chgres_cube is compiled as part of NCEPLIBS, which compiles in "production" mode.

I looked at your run directory:

cat chgres_cube.200911-105345.log ...

  • CALL FieldScatter FOR INPUT GRID LONGITUDE.
  • CALL FieldScatter FOR INPUT GRID LATITUDE.

    0 0x2b4035126aff in ???

    1 0x2b40357ae9bb in ???

    0 0x2ab63cbdbaff in ???

    1 0x2ab63d2639bb in ???

    0 0x2ab63cbdbaff in ???

    1 0x2ab63d2639bb in ???

    0 0x2ab63cbdbaff in ???

    1 0x2ab63d2639bb in ???

    0 0x2b2d04fcdaff in ???

    1 0x2b2d056559bb in ???

    0 0x2b2d04fcdaff in ???

    1 0x2b2d056559bb in ???

    0 0x2b2d04fcdaff in ???

    1 0x2b2d056559bb in ???

    0 0x2b2d04fcdaff in ???

    1 0x2b2d056559bb in ???

    0 0x2b2d04fcdaff in ???

    0 0x2b2d04fcdaff in ???

    1 0x2b2d056559bb in ???

    1 0x2b2d056559bb in ???

    MPT ERROR: MPI_COMM_WORLD rank 21 has terminated without calling MPI_Finalize() aborting job MPT: Received signal 11

I also checked PET21.ESMF_LogFile for the mpi rank that reported the crash (first), but there is no useful useful information in the file.

Let me compile chgres_cube manually with debugging flags on, then copy your run directory and run the preprocessing step manually.

On Sep 11, 2020, at 3:01 PM, Laurie Carson notifications@github.com wrote:

Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure:

/glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run

Does the _D part refer to a debug-mode compile? (just curious)

On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu notifications@github.com wrote:

@llpcarson https://github.com/llpcarson I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-691303457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-691310086, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RLFX6SXOMU43DQ32S3SFKF4DANCNFSM4RFMB6EA.

llpcarson commented 3 years ago

I just re-ran the reg-test for the C768 cases only, and all 8 tests ran chgres_cube without error (forecast/model are still in the queue). Very frustrating!

And, unfortunately, even with a failed run directory, a re-run (with a simple qsub script) completes without error.

Will check back later tonight to see if any of the model runs hang/finish :)

On Fri, Sep 11, 2020 at 3:00 PM Laurie Carson carson@ucar.edu wrote:

Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure:

/glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run

Does the _D part refer to a debug-mode compile? (just curious)

On Fri, Sep 11, 2020 at 2:44 PM Ufuk Turunçoğlu notifications@github.com wrote:

@llpcarson https://github.com/llpcarson I could not find /glade/scratch/carson/ufs/. Is it correct? Yes, the error is strange it show that the file is missing or corrupted but all the cases use same file. Did you also run GNU tests?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-691303457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIVMGUJH27UEDKQLOFTSFKD2JANCNFSM4RFMB6EA .

uturuncoglu commented 3 years ago

@llpcarson D is debug mode. I am not sure about the potions that are changed but if you need I could check it.

uturuncoglu commented 3 years ago

Yes, I ran both Gnu and Intel. Each had failure and success for chgres_cube. Here's one of the run-dirs with a failure: /glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run Does the _D part refer to a debug-mode compile? (just curious)

Yes, I check your directory and it seems there is no build error but all C768 test are failed due to the failure in CHGRES.

uturuncoglu commented 3 years ago

@llpcarson yes, in some cases if you run the model again CHGRES process without any problem. I am not sure but it could be node allocation on Cheyenne. It might be nice to check in the other platforms.

climbfuji commented 3 years ago

I don't know if this is a red herring, but I there might be some offending entries in cime/config/ufs/machines/config_machines.xml that we shouldn't use (and never use for the UFS model anyway):

      <modules compiler="intel">
        <command name="load">intel/19.0.5</command>
    <command name="load">mkl</command>
      </modules>
      <modules compiler="gnu">
        <command name="load">gnu/8.3.0</command>
        <command name="load">openblas/0.3.6</command>
      </modules>

For Intel, we don't use mkl at all. Similarly, for GNU, we don't use blas.

I compiled chgres_cube.exe manually twice for GNU, using the existing NCEPLIBS, with the standard release/prod flags and with the debug flags. I then ran chgres_cube.exe manually on a copy of @llpcarson's failed GNU run dir

/glade/scratch/carson/ufs/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run

and it went through every single time, for both prod and debug. It also worked every time I tried for the existing chgres_cube.exe of the NCEPLIBS ufs-v1.1.0 installation. My test directory is

/glade/scratch/heinzell/SMS_Lh3_D.C768.GFSv15p2.cheyenne_gnu.G.20200911_091828_ou9in9/run

and the job submission script inside this directory is called job_card (i.e. I did qsub job_card).

Most important, I changed the number of nodes to 3 and used 12 tasks on each of them, i.,e. I effectively halved the amount of memory available and it worked fine every single time I tried. So I don't think it has to do with the memory.

My environment has a stack size limit of 300000, I think this is default, at least I don't remember that I change it.

I also tried a version of chgres_cube.exe that was compiled with OpenMP and I used 2 OpenMP threads, which worked as well.

Running out of ideas right now, except for trying to remove mkl and blas.

uturuncoglu commented 3 years ago

@climbfuji thanks for your help and lots of tests. As @llpcarson mentioned before the issue is not persistent and works in some cases but fails in others. Anyway, i will remove mkl and blas options from the machine file and run full test suit on Cheyenne and update you through the issue. @jedwards4b what do you think? Can mkl and blas cause the issue?

uturuncoglu commented 3 years ago

@climbfuji BTW, it seems that blas is not used by the CIME for GNU compiler. I could not find any entry on machines/config_compilers.xml. @jedwards4b can you confirm? I removed loading mkl and also update machines/config_compilers.xml for Intel and comment out all mkl related entries. Testing now.

uturuncoglu commented 3 years ago

@climbfuji BTW, how can loading those modules affect the chgres_cube. We are using pre-installed NCEPLIBS. Do you think that loading mkl could could have effect in run time for preinstalled chgres_cube?

uturuncoglu commented 3 years ago

@climbfuji I run the full test suite with updated code (no mkl) and model still hangs. Following is the one example,

/glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200911_214506_127zo6

in this case CHGRES is run without any problem but model crashed without any error when it is trying to read /glade/p/cesmdata/cseg/ufs_inputdata/global/fix/fix_am.v20191213/global_slmask. t1534.3072.1536.grb file. Again all resolutions except C96 ones are failing under my account.ulimit -s` command shows unlimited in my case. CHGRES could also process the raw input for all C768 cases. I am not sure this is because of not loading mkl.

climbfuji commented 3 years ago

@climbfuji I run the full test suite with updated code (no mkl) and model still hangs. Following is the one example,

/glade/scratch/turuncu/SMS_Lh3.C192.GFSv15p2.cheyenne_intel.20200911_214506_127zo6

in this case CHGRES is run without any problem but model crashed without any error when it is trying to read /glade/p/cesmdata/cseg/ufs_inputdata/global/fix/fix_am.v20191213/global_slmask. t1534.3072.1536.grb file. Again all resolutions except C96 ones are failing under my account.ulimit -s` command shows unlimited in my case. CHGRES could also process the raw input for all C768 cases. I am not sure this is because of not loading mkl.

It is possible that loading mkl does something to the environment so that a shared library is picked up from a different location. The mkl module does a whole bunch of things, among them:

-- Add libraries and headers to the environment
prepend_path("LD_LIBRARY_PATH", tbbpath)
prepend_path("LD_LIBRARY_PATH", libpath)

If a shared library that is linked into an executable exists in those paths with the same name, it will be picked up from there.

Can you try to set ulimit -s 300000 ? Some applications - although very few - have trouble with an unlimited stacksize, because they think that all available memory is for the stack and nothing is left for the heap.

climbfuji commented 3 years ago

@climbfuji thanks for your help and lots of tests. As @llpcarson mentioned before the issue is not persistent and works in some cases but fails in others. Anyway, i will remove mkl and blas options from the machine file and run full test suit on Cheyenne and update you through the issue. @jedwards4b what do you think? Can mkl and blas cause the issue?

I wanted to reply to this comment, too. I repeated the chgres_cube.exe runs several times for each test, because you said the failures were intermittent. It worked for me all the time.

climbfuji commented 3 years ago

Three of the 19 regression tests failed for me on cheyenne.intel in my latest atttempt, one at least for chgres_cube.exe for C768. I see the following in the stacktrace:

MPT: #7  0x00000000005337cd in error_handler (string=...,
MPT:     rc=<error reading variable: Cannot access memory at address 0x0>,
MPT:     .tmp.STRING.len_V$7=55327936)
MPT:     at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/utils.f90:11
MPT: #8  0x00000000004ed732 in model_grid::define_input_grid_gfs_grib2 (localpet=0,
MPT:     npets=<error reading variable: Cannot access memory at address 0x0>)
MPT:     at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/model_grid.F90:640

#7 is not the actual reason for the failure - but still a bug in chgres_cube.exe:

 print*,"- FATAL ERROR: ", string
 print*,"- IOSTAT IS: ", rc
 call mpi_abort

This should be:

 print*,"- FATAL ERROR: ", string
 print*,"- IOSTAT IS: ", rc
 call mpi_abort(mpi_comm_world, 999)

The actual error comes from reading a grib2 file (sorc/chgres_cube.fd/model_grid.F90:640):

 rc = grb2_inq(the_file,inv_file,':PRES:',':surface:',nx=i_input, ny=j_input, &
    lat=lat4, lon=lon4)
 if (rc /= 1) call error_handler("READING GRIB2 FILE", rc)
climbfuji commented 3 years ago

I ran the cheyenne.intel CIME regression tests with the following modifications and got exactly one failure (see below) all other 18 tests passed.

diff --git a/config/ufs/machines/config_compilers.xml b/config/ufs/machines/config_compilers.xml
index 4c8fad749..114c21c8b 100644
--- a/config/ufs/machines/config_compilers.xml
+++ b/config/ufs/machines/config_compilers.xml
@@ -277,14 +277,6 @@ using a fortran linker.
   <SCXX> icpc </SCXX>
   <SFC> ifort </SFC>
   <SLIBS>
-    <append MPILIB="mpich"> -mkl=cluster </append>
-    <append MPILIB="mpich2"> -mkl=cluster </append>
-    <append MPILIB="mvapich"> -mkl=cluster </append>
-    <append MPILIB="mvapich2"> -mkl=cluster </append>
-    <append MPILIB="mpt"> -mkl=cluster </append>
-    <append MPILIB="openmpi"> -mkl=cluster </append>
-    <append MPILIB="impi"> -mkl=cluster </append>
-    <append MPILIB="mpi-serial"> -mkl </append>
   </SLIBS>
   <SUPPORTS_CXX>TRUE</SUPPORTS_CXX>
 </compiler>
diff --git a/config/ufs/machines/config_machines.xml b/config/ufs/machines/config_machines.xml
index 1b53603ea..9bb28d6aa 100644
--- a/config/ufs/machines/config_machines.xml
+++ b/config/ufs/machines/config_machines.xml
@@ -111,11 +111,9 @@ This allows using a different mpirun command to launch unit tests
       </modules>
       <modules compiler="intel">
         <command name="load">intel/19.0.5</command>
-       <command name="load">mkl</command>
       </modules>
       <modules compiler="gnu">
         <command name="load">gnu/8.3.0</command>
-        <command name="load">openblas/0.3.6</command>
       </modules>
       <modules mpilib="mpt" compiler="gnu">
        <command name="load">mpt/2.19</command>
@@ -161,17 +159,21 @@ This allows using a different mpirun command to launch unit tests
       <env name="MPI_USE_ARRAY">false</env>
     </environment_variables>
     <environment_variables>
-      <env name="ESMF_RUNTIME_PROFILE">ON</env>
+      <!-- <env name="ESMF_RUNTIME_PROFILE">ON</env>
       <env name="ESMF_RUNTIME_PROFILE_OUTPUT">SUMMARY</env>
       <env name="OMP_NUM_THREADS">1</env>
       <env name="OMP_STACKSIZE">1024M</env>
       <env name="MPI_TYPE_DEPTH">16</env>
       <env name="MPI_IB_CONGESTED">1</env>
-      <env name="MPI_USE_ARRAY"/>
+      <env name="MPI_USE_ARRAY"/> -->
+      <env name="MPI_TYPE_DEPTH">20</env>
+      <env name="OMP_STACKSIZE">512M</env>
+      <env name="OMP_NUM_THREADS">1</env>
+      <env name="ESMF_RUNTIME_COMPLIANCECHECK">OFF:depth=4</env>
     </environment_variables>
-    <resource_limits>
-      <resource name="RLIMIT_STACK">-1</resource>
-    </resource_limits>
+    <!-- <resource_limits>
+      <resource name="RLIMIT_STACK">300000</resource>
+    </resource_limits> -->
   </machine>

   <machine MACH="gaea">

The following test failed in chgres_cube.exe with the same segmentation fault as in the previous comment: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel:

MPT: #7  0x00000000005337cd in error_handler (string=...,
MPT:     rc=<error reading variable: Cannot access memory at address 0x0>,
MPT:     .tmp.STRING.len_V$7=55327936)
MPT:     at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/utils.f90:11
MPT: #8  0x00000000004ed732 in model_grid::define_input_grid_gfs_grib2 (localpet=0,
MPT:     npets=<error reading variable: Cannot access memory at address 0x0>)
MPT:     at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/model_grid.F90:640

@GeorgeGayno-NOAA as noted in my previous comment https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-692109267, #7 is not the actual reason for the failure, but still a bug in chgres_cube.exe.

uturuncoglu commented 3 years ago

@climbfuji @GeorgeGayno-NOAA is there any update about the issue? There could be a memory leak for grib2 related function.

GeorgeGayno-NOAA commented 3 years ago

Three of the 19 regression tests failed for me on cheyenne.intel in my latest atttempt, one at least for chgres_cube.exe for C768. I see the following in the stacktrace:

MPT: #7  0x00000000005337cd in error_handler (string=...,
MPT:     rc=<error reading variable: Cannot access memory at address 0x0>,
MPT:     .tmp.STRING.len_V$7=55327936)
MPT:     at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/utils.f90:11
MPT: #8  0x00000000004ed732 in model_grid::define_input_grid_gfs_grib2 (localpet=0,
MPT:     npets=<error reading variable: Cannot access memory at address 0x0>)
MPT:     at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/model_grid.F90:640

#7 is not the actual reason for the failure - but still a bug in chgres_cube.exe:

 print*,"- FATAL ERROR: ", string
 print*,"- IOSTAT IS: ", rc
 call mpi_abort

This should be:

 print*,"- FATAL ERROR: ", string
 print*,"- IOSTAT IS: ", rc
 call mpi_abort(mpi_comm_world, 999)

The actual error comes from reading a grib2 file (sorc/chgres_cube.fd/model_grid.F90:640):

 rc = grb2_inq(the_file,inv_file,':PRES:',':surface:',nx=i_input, ny=j_input, &
    lat=lat4, lon=lon4)
 if (rc /= 1) call error_handler("READING GRIB2 FILE", rc)

It is trying to read the surface pressure field from the grib2 file. Is the grib2 file corrupt? Is the pressure field missing? Was the inv_file created ok ("./chgres.inv")?

uturuncoglu commented 3 years ago

@GeorgeGayno-NOAA I check the input files and try to use new ones before and it was giving error. So, it seems it is not related with the file itself. Also, we are getting similar problems from different platforms. @ligiabernardet Am I right? Also, same chgres namelist file should be ok because same script is used to generate the namelist file and we have passed C768 test cases. I have no failed case right now but if somebody else it would be nice to check it. @climbfuji do you have failed case that we could use to chat the chgres namelist file.

ligiabernardet commented 3 years ago

@llpcarson do you have a failed chgres cube that we can use to check a) namelist and b) GRIB2 inventory ./chgres.inv?

@uturuncoglu I confirm that we also have a failure of chgres_cube C768 on Orion. Results from other platforms:

llpcarson commented 3 years ago

There's a set of cases on cheyenne here: /glade/scratch/carson/ufs/mrw.test/stack/

Fails: SMS_Lh3_D.C768.GFSv16beta.cheyenne_intel.20200915_090650_qga1v1/run/

Runs: SMS_Lh3_D.C768.GFSv15p2.cheyenne_intel.20200915_090650_qga1v1/run/

Both chgres.inv files are identical. Both namelist files are identical.

Laurie

On Thu, Sep 17, 2020 at 1:16 PM ligiabernardet notifications@github.com wrote:

@llpcarson https://github.com/llpcarson do you have a failed chgres cube that we can use to check a) namelist and b) GRIB2 inventory ./chgres.inv?

@uturuncoglu https://github.com/uturuncoglu I confirm that we also have a failure of chgres_cube C768 on Orion. Results from other platforms:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/190#issuecomment-694443943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWIWUOVYCEVBOL5VK45TSGJOA5ANCNFSM4RFMB6EA .

uturuncoglu commented 3 years ago

@ligiabernardet thanks for the update. I am not sure the where the source of problem. While we have also running cases, I am suspecting from chgres and it might have memory leak etc.

ligiabernardet commented 3 years ago

@GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @climbfuji @rsdunlapiv @llpcarson Do we have any other hypothesis or idea of what to try to get chgres_cube to work in CIME consistently?

Cheyenne: Occasional crashes of chgres_cube C768 when reading GRB2 file Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs) Hera: C768 passed (only 1 run tested) Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194) Stampede: waiting RT results Gaea: waiting C768 results

GeorgeGayno-NOAA commented 3 years ago

@GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @climbfuji @rsdunlapiv @llpcarson Do we have any other hypothesis or idea of what to try to get chgres_cube to work in CIME consistently?

Cheyenne: Occasional crashes of chgres_cube C768 when reading GRB2 file Jet: C768 passed (only 1 run tested - it takes more than a day in the queue due to reservations, so hard to do many runs) Hera: C768 passed (only 1 run tested) Orion: 1/19 tests that are part of RT crashed on chgres_cube (#194) Stampede: waiting RT results Gaea: waiting C768 results

Is it always failing at the same spot (model_grid.F90 line 640)? And the failure occurs randomly? Do the RT tests run in sequence or simultaneously?

climbfuji commented 3 years ago

@ligiabernardet @GeorgeGayno-NOAA @climbfuji @uturuncoglu @arunchawla-NOAA @jedwards4b @rsdunlapiv @llpcarson

I just got a successful run of all regression tests on Cheyenne with Intel. This is what I did:

PRs:

climbfuji commented 3 years ago

Ok, here we go ... just got one failure with Intel 18.0.5 on Cheyenne in my second round of tests (when running both Intel and GNU tests with the same command). Super annoying. Will see how the rest works out.

climbfuji commented 3 years ago

@uturuncoglu is there a way to force the tests to run serially, i.e. only one regression test running at a time?

climbfuji commented 3 years ago

@uturuncoglu another question, how do I change the default MPI job size for chgres in cime? I want the regression tests to run on a different number of nodes with a different number of tasks per node, still 36 tasks in total for C768. Thanks ...