ufs-community / ufs-weather-model

UFS Weather Model
Other
136 stars 244 forks source link

Cheyenne LMOD issue after last maintenance #1470

Closed jkbk2004 closed 1 year ago

jkbk2004 commented 1 year ago

Description

Lmod Warning: MODULEPATH directory: "/glade/scratch/jongkim/pr-dev-gnu/jongkim/FV3_RT/rt_57077/control_p8" has too many non-modulefiles (149). Please make sure that modulefiles are in their own directory and not mixed in with non-modulefiles (e.g. source code)

To Reproduce:

A good portions of regressions are running ok but some of them fail with both intel and gnu. Failed cases are: cpld_control_p8 001 failed in run_test cpld_control_ciceC_p8 002 failed in run_test cpld_control_c192_p8 003 failed in run_test cpld_control_noaero_p8 004 failed in run_test cpld_control_nowave_noaero_p8 005 failed in run_test cpld_debug_p8 006 failed in run_test cpld_debug_noaero_p8 007 failed in run_test cpld_control_noaero_p8_agrid 008 failed in run_test cpld_control_c48 009 failed in run_test cpld_warmstart_c48 010 failed in run_test control_c384gdas 019 failed in run_test control_p8 025 failed in run_test control_p8_lndp 026 failed in run_test control_p8_rrtmgp 027 failed in run_test merra2_thompson 028 failed in run_test rap_rrtmgp 034 failed in run_test control_csawmg 044 failed in run_test control_csawmgt 045 failed in run_test control_wam 047 failed in run_test control_csawmg_debug 055 failed in run_test control_csawmgt_debug 056 failed in run_test control_debug_p8 059 failed in run_test rap_rrtmgp_debug 069 failed in run_test control_wam_debug 073 failed in run_test hafs_regional_atm_ocn 083 failed in run_test hafs_regional_atm_ocn_wav 085 failed in run_test hafs_global_multiple_4nests_atm 089 failed in run_test hafs_regional_storm_following_1nest_atm_ocn 092 failed in run_test hafs_regional_storm_following_1nest_atm_ocn_wav 093 failed in run_test control_atmwav 110 failed in run_test atmaero_control_p8 111 failed in run_test atmaero_control_p8_rad 112 failed in run_test atmaero_control_p8_rad_micro 113 failed in run_test regional_atmaq 114 failed in run_test

Additional context

Need to follow up with SRW team and Cheyenne CISL help desk

jkbk2004 commented 1 year ago

During the maintenance, lmod was updated from 8.1.7 to 8.7.13: /glade/u/apps/ch/modulefiles/default/localinit/localinit.sh

DeniseWorthen commented 1 year ago

@jkbk2004 is there a quick fix I can use? I've been using Cheyenne as my main development platform for the unstructured mesh for waves since hera is too slow.

jkbk2004 commented 1 year ago

@DeniseWorthen can you do source /glade/scratch/jongkim/localinit.sh ? and see how it goes. I reverted to 8.1.7. I am talking to cisl help desk. They are suggesting to set modulefiles (lua) different way. But I need time to test for that option.

uturuncoglu commented 1 year ago

@DeniseWorthen @jkbk2004 I also create a ticket about it last wee but I did not get any response yet. I'll update you if I get.

uturuncoglu commented 1 year ago

I got following response from support. It seems that the issue related with the newer version of the Lmod and we ned to support also those cases.


Browsing your github issue, the structure of /glade/scratch/jongkim/pr-dev-gnu/jongkim/FV3_RT/rt_57077/control_p8 seems to be the issue. Lmod wants all the files in the search path to be module files, whereas that's not the case.

could you simply create a 'modulefiles' subdirectory and place the .lua files there, then point to that path via MODULEPATH?

I'm sure your NOAA machines will show the same issue when Lmod is upgraded.

benkirk commented 1 year ago

Hi, I hope it's OK to weigh in here:

Yes, we upgraded Lmod on Cheyenne, actually at the request of NOAA/EPIC.

What you are seeing is a message from Lmod that apparently is introduced in v8.7.3: https://github.com/TACC/Lmod/blob/main/README.new

The issue stems from having module and non-module files in the same directory. Lmod prefers to have modulefiles isolated, so I'd recommend creating a './modulefiles' subdirectory, placing the lua files within, and then using that path in your MODULEPATH

Regards,

-Ben

jkbk2004 commented 1 year ago

@DeniseWorthen Clearly @benkirk 's suggestion makes a sense. I am looking into script.

jkbk2004 commented 1 year ago

@DeniseWorthen @uturuncoglu take a look at the fix on cheyenne: pr to Mathew's #1454

DeniseWorthen commented 1 year ago

@jkbk2004 I modified the job_card in one of my sandboxes as you showed and the job is now running.

DeniseWorthen commented 1 year ago

@jkbk2004 Your fix does not allow you to re-run in the same sandbox. When I tried, I got two errors. First, the modulefiles directory already existed and once I commented out the mkdir line, it failed because there were no *.lua files to move.

jkbk2004 commented 1 year ago

@jkbk2004 Your fix does not allow you to re-run in the same sandbox. When I tried, I got two errors. First, the modulefiles directory already existed and once I commented out the mkdir line, it failed because there were no *.lua files to move.

@DeniseWorthen I missed the line to check modulefiles. Can you take a look at lines 16-19 ?