ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 23 forks source link

C768 RT failed on orion #194

Closed panll closed 3 years ago

panll commented 3 years ago

There are several failures for C768 tests: 1) SMS_Lh3.C768r.HAFS.orion_intel and ERS_Lh11.C768r.HAFS.orion_intel.

No input file/s found locally! Need files under ufs_inputdata/regional/bcond/2019090118

2) SMS_Lh3_D.C768.GFSv16beta.orion_intel failed at chgres ERROR: Command: 'sbatch --time 12:00:00 -q batch --account gmtb --dependency=afterok:380631 .case.test --skip-preview-namelist' failed with error 'sbatch: error: QOSMaxWallDurationPerJobLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)' from dir '/work/noaa/gmtb/lpan/test/09112020/scratch/SMS_Lh3_D.C768.GFSv16beta.orion_intel.G.20200911_235539_bp53ie'

3)SMS_Lh3_D.C768.GFSv15p2.orion_intel (Overall: FAIL) details: ERROR: Command: 'sbatch --time 12:00:00 -q batch --account gmtb --dependency=afterok:380630 .case.test --skip-preview-namelist' failed with error 'sbatch: error: QOSMaxWallDurationPerJobLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)' from dir '/work/noaa/gmtb/lpan/test/09112020/scratch/SMS_Lh3_D.C768.GFSv15p2.orion_intel.G.20200911_235539_bp53ie'

The working directory can be found at /work/noaa/gmtb/lpan/test/09112020/scratch

climbfuji commented 3 years ago

If orion is the same as other RDHPC platforms, the maximum walltime is 8h.

ligiabernardet commented 3 years ago

@Linlin Pan - NOAA Affiliate linlin.pan@noaa.gov Did you follow the instructions in Tab "RT Instructions" in the spreadsheet that say you must use --xml-category prealpha_p1, p2, p3? If you do not do that, all tests will be launched. This will cause two problems: launching of HAFS tests (which we do not want for MRW App) and too many jobs in the queue. Please look at the instructions for RT tes on Orion, and see if you need to rerun the test.

On Mon, Sep 14, 2020 at 11:39 AM panll notifications@github.com wrote:

There are several failures for C768 tests:

  1. SMS_Lh3.C768r.HAFS.orion_intel and ERS_Lh11.C768r.HAFS.orion_intel.

No input file/s found locally! Need files under ufs_inputdata/regional/bcond/2019090118

  1. SMS_Lh3_D.C768.GFSv16beta.orion_intel failed at chgres ERROR: Command: 'sbatch --time 12:00:00 -q batch --account gmtb --dependency=afterok:380631 .case.test --skip-preview-namelist' failed with error 'sbatch: error: QOSMaxWallDurationPerJobLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)' from dir '/work/noaa/gmtb/lpan/test/09112020/scratch/SMS_Lh3_D.C768.GFSv16beta.orion_intel.G.20200911_235539_bp53ie'

3)SMS_Lh3_D.C768.GFSv15p2.orion_intel (Overall: FAIL) details: ERROR: Command: 'sbatch --time 12:00:00 -q batch --account gmtb --dependency=afterok:380630 .case.test --skip-preview-namelist' failed with error 'sbatch: error: QOSMaxWallDurationPerJobLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)' from dir '/work/noaa/gmtb/lpan/test/09112020/scratch/SMS_Lh3_D.C768.GFSv15p2.orion_intel.G.20200911_235539_bp53ie'

The working directory can be found at /work/noaa/gmtb/lpan/test/09112020/scratch

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7WQAVKAFQG6DIAXT3SGTLSFZIL5ANCNFSM4RL4AVMQ .

panll commented 3 years ago

Thanks, Ligia! I'll double check that. @ligiabernardet

uturuncoglu commented 3 years ago

@ligiabernardet prealpha_p1, p2, p3 is only for Stampede, the other need to use prealpha only.

ligiabernardet commented 3 years ago

OK, my bad. @panll : we updated the instructions today for all platforms (beyond Stampede): it is necessary to add --xml-category prealpha, otherwise HAFS tests will be triggered. Sorry that the instructions did not include that last week. Pls try again including --xml-category prealpha.

panll commented 3 years ago

Thanks, Ligia and ufuk! @ligiabernardet @uturuncoglu

panll commented 3 years ago

Yes, the file testlist.xml needs to be changed for Orion @climbfuji

uturuncoglu commented 3 years ago

@panll @ligiabernardet I adjusted wallclock time for C768 debug tests to be consistent of max wallclock time on Orion. The max wall clock time in the tests is now 8 hours. Those changes will be available in next update but until that time src/model/FV3/cime/cime_config/testlist.xml can be edited and 12:00:00 hours can be changed to 08:00:00.

panll commented 3 years ago

Thanks, ufuk! @uturuncoglu @ligiabernardet I redo the tests with 8 hours, all passed except one test: SMS_Lh3.C768.GFSv16beta.orion_intel. This test crashed at chgres_cube:

ligiabernardet commented 3 years ago

@climbfuji @uturuncoglu Do you have any suggestion as to how the environment (stack size etc.) could be modified for Orion?

uturuncoglu commented 3 years ago

@panll I think that setting stack size is not helping. it could be the same issue that is explained in here https://github.com/ufs-community/ufs-mrweather-app/issues/190.

panll commented 3 years ago

Thanks for checking that! @uturuncoglu

arunchawla-NOAA commented 3 years ago

should be ok with changes to ufs_utils