ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

UWM failed on HERA ROCKY #2211

Closed jiandewang closed 1 month ago

jiandewang commented 1 month ago

Description

I am testing updated MOM6 code in UWM but got unexpected failure so I turned back to develop branch and changed nothing, but got the same error.

EXTCDE MPI_ABORT, IEXIT= 52

see error information at /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_73092/cpld_control_p8_mixedmode_intel

To Reproduce:

clone today's UWM (hash # c54e9863) run one of S2S job, for my case I ran "cpld_control_p8_mixedmode_inte"

Additional context

Output

jiandewang commented 1 month ago

I am repeating on orion, jobs are running now, at least no dying job at this moment.

DeniseWorthen commented 1 month ago

@jiandewang That error message is coming from WW3 I believe. Can you check what you have in log.ww3?

jkbk2004 commented 1 month ago

@jiandewang can you re-run? WW3_input_data_20220624 is re-covered on hera.

DusanJovic-NOAA commented 1 month ago

I also see this error when I run cpld_control_p8 with the current develop branch:

180:  *** WAVEWATCH III ERROR IN W3IOGR : 
180:      ERROR IN READING FROM mod_def.ww3 FILE
180:      IOSTAT =   67     MOD DEF FILE WAS GENERATED WITH A DIFFERENT            
180:      WW3 VERSION OR USING A DIFFERENT SWITCH FILE.          
180:      MAKE SURE WW3_GRID IS COMPILED WITH SAME SWITCH        
180:      AS WW3_SHEL OR WW3_MULTI, RUN WW3_GRID AGAIN           
180:      AND THEN TRY AGAIN THE PROGRAM YOU JUST USED.          
180: 
180: 
180: 
180: EXTCDE MPI_ABORT, IEXIT=    52
180: 

/scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_1182930/cpld_control_p8_intel

DeniseWorthen commented 1 month ago

@jiandewang @DusanJovic-NOAA That is why the original WW3-input data needs to be retained. Input data should never be overwritten. Only adding is allowable.

DeniseWorthen commented 1 month ago

@jiandewang can you re-run? WW3_input_data_20220624 is re-covered on hera.

@jkbk2004 Please make that everyone on your team understands the importance of NOT overwriting input data.

jkbk2004 commented 1 month ago

@jiandewang can you re-run? WW3_input_data_20220624 is re-covered on hera.

@jkbk2004 Please make that everyone on your team understands the importance of NOT overwriting input data.

We always backed up. @zach1221 @FernandoAndrade-NOAA FYI

JessicaMeixner-NOAA commented 1 month ago

@jiandewang just confirming that it is an input error. Was there a reason that the WW3 input data was over-written? We add a specific date/time stamp so that we can version control the input and not over-write it.

jkbk2004 commented 1 month ago

@jiandewang just confirming that it is an input error. Was there a reason that the WW3 input data was over-written? We add a specific date/time stamp so that we can version control the input and not over-write it.

My fault! input directory names were switched back and forth.

jiandewang commented 1 month ago

just are running normal now. Close this issue

jiandewang commented 1 month ago

same problem happened on c5, need to do the same fixing

DeniseWorthen commented 1 month ago

@jiandewang It looks to me the files are OK on Gaea. Are you sure your rt didn't fail on Gaea because of this https://github.com/ufs-community/ufs-weather-model/issues/2198? The fix for this will be coming in w/ the WW3 PR today but before that you need to modify this part of rt.sh

STMP=/gpfs/f5/epic/scratch PTMP=/gpfs/f5/epic/scratch

jiandewang commented 1 month ago

@DeniseWorthen yes I changed those two lines otherwise my job will not be able to be sumbitted. see my rundir: /gpfs/f5/nggps_emc/scratch/Jiande.Wang/ptmp/Jiande.Wang/FV3_RT/rt_235629/cpld_control_p8_intel/out 180: *** WAVEWATCH III ERROR IN W3IOGR : 180: ERROR IN READING FROM mod_def.ww3 FILE 180: IOSTAT = 67 MOD DEF FILE WAS GENERATED WITH A DIFFERENT 180: WW3 VERSION OR USING A DIFFERENT SWITCH FILE. 180: MAKE SURE WW3_GRID IS COMPILED WITH SAME SWITCH 180: AS WW3_SHEL OR WW3_MULTI, RUN WW3_GRID AGAIN 180: AND THEN TRY AGAIN THE PROGRAM YOU JUST USED. 180: 180: 180: 180: EXTCDE MPI_ABORT, IEXIT= 52

the UWM is based on yesterday's commit my UWM: /gpfs/f5/nggps_emc/scratch/Jiande.Wang/MOM6-update/NCAR-20230913/ufs-weather-model

jkbk2004 commented 1 month ago

@jiandewang Sorry about interruption. WW3_input_data_20220624 is restored. looks like running ok. Can you check again?

PASS -- COMPILE 's2swa_32bit_intel' [22:16, 20:32]
PASS -- TEST 'cpld_control_p8_mixedmode_intel' [11:23, 07:30](3070 MB)
jiandewang commented 1 month ago

@jiandewang Sorry about interruption. WW3_input_data_20220624 is restored. looks like running ok. Can you check again?

PASS -- COMPILE 's2swa_32bit_intel' [22:16, 20:32]
PASS -- TEST 'cpld_control_p8_mixedmode_intel' [11:23, 07:30](3070 MB)

thanks for the quick action, let me re-launch my job. I have a NCAR MOM6 PR which I really want to give them a balck and white answer before my A/L

jiandewang commented 1 month ago

works fine now. close